Your RAG Gets Confidently Wrong as Memory Grows – I Built the Memory Layer That Stops It

Contents

TL;DR:

a controlled four-phase experiment in pure Python, with real benchmark numbers. No API key. No GPU. Runs in under 10 seconds.

As memory grows from 10 to 500 entries, accuracy drops from 50% to 30%
Over the same range, confidence rises from 70.4% to 78.0% — your alerts will never fire
The fix is four architectural mechanisms: topic routing, deduplication, relevance eviction, and lexical reranking
50 well-chosen entries outperform 500 accumulated ones. The constraint is the feature.

The Failure That Shouldn’t Have Happened

I ran a controlled experiment on a customer support LLM with long-term memory.

Nothing else changed. Not the model. Not the retrieval pipeline.

At first, it worked perfectly. It answered questions about payment thresholds, password resets, and API rate limits with near-perfect accuracy. Then the system kept running.

Every interaction was stored:

meeting notes
onboarding checklists
internal reminders
operational noise

All mixed with the actual answers.

Three months later, a user asked:

“How do I reset a user account password?”

The system responded:

“VPN certificate expires in 30 days.”

Confidence: 78.5%

Three months earlier, when it was correct:

Confidence: 73.2%

The system didn’t get worse. It got more confident while being wrong.

Here, 78.5% is the single-query confidence and 75.8% is the 10-query average.

Why This Matters to You Right Now

If you are building any of the following:

A RAG system that accumulates retrieved documents over time
An AI copilot with a persistent memory store
A customer support agent that logs past interactions
Any LLM workflow where context grows across sessions

This failure mode is very likely already happening in your system. You probably have not measured it, because the signal that should warn you — agent confidence — is moving in the wrong direction.

The agent is not getting dumber. It is getting confidently wrong. And there is nothing in a standard retrieval pipeline that will catch this before users do.

This article shows you exactly what is happening, why, and how to fix it. No API key required. No model downloads. All results reproduced in under 10 seconds on CPU.

The Surprise (Read This Before the Code)

Here is the counterintuitive finding, stated plainly before any proof:

As memory grows from 10 to 500 entries, agent accuracy drops from 50% to 30%. Over the same range, agent confidence rises from 70.4% to 78.0%.

The agent becomes more confident as it becomes less accurate. These two signals move in opposite directions. Any monitoring system that alerts on low confidence will never fire. The failure is invisible by design.

As memory grows, accuracy drops while confidence rises, exposing a hidden failure in RAG systems driven by similarity-based retrieval. Image by Athor

This is not a quirk of the simulation. It follows from the way retrieval confidence is computed in virtually every production RAG system: as a function of mean similarity score across retrieved entries [4]. As the memory pool grows, more entries achieve moderate similarity to any given query — not because they are relevant, but because large diverse corpora guarantee near-matches. Mean similarity drifts upward. Confidence follows. Accuracy does not.

Now let us prove it.

Complete Code: https://github.com/Emmimal/memory-leak-rag/

The Setup: A Support Agent With a Growing Memory Problem

The simulation models a customer support and API-documentation agent. Ten realistic queries cover payment fraud detection, authentication flows, API rate limiting, refund policies, and shipping. A memory pool grows from 10 to 500 entries.

The memory pool mixes two kinds of entries:

Relevant entries — the correct answers, stored early. Things like:

payment fraud threshold is $500 for review
POST /auth/reset resets user password via email
rate limit exceeded returns 429 error code

Stale entries — organizational noise that accumulates over time. Things like:

quarterly board meeting notes reviewed budget
VPN certificate expires in 30 days notify users
catering order placed for all-hands meeting Friday

As memory size grows, the ratio of stale entries increases. The relevant entries stay put. The noise multiplies around them.

Embeddings are deterministic and keyword-seeded — no external model or API needed. Every result here is reproducible by running one Python file.

The companion code requires only numpy, scipy, and colorama. Link in the References section.

Four-phase diagram of RAG memory failure and recovery, showing relevance decay (44% → 14%), confidence increasing while accuracy drops, stale entries dominating retrieval due to small similarity gaps, and managed memory restoring accuracy with fewer entries. — RAG systems fail silently as confidence rises and relevance decays, until managed memory restores accuracy. Image by Athor

Phase 1 — Relevance Collapses Silently

Start with the most basic question: of the five entries retrieved for each query, how many are actually relevant to what was asked?

Memory Size	Relevance Rate	Accuracy
10 entries	44%	50%
25 entries	34%	50%
50 entries	34%	50%
100 entries	24%	50%
200 entries	22%	40%
500 entries	14%	30%

At 10 entries, fewer than half the retrieved results are relevant — but that is enough to get the right answer most of the time when the correct entry ranks first. At 500 entries, six out of seven retrieved entries are noise. The agent is essentially building its answer from scratch, with one relevant entry buried under six irrelevant ones.

The agent does not pause. It does not flag uncertainty. It keeps returning answers at the same speed with the same tone of authority.

This is the first failure mode: relevance decays silently.

Why cosine similarity cannot save you here

The intuition behind retrieval is sound: store entries as vectors, find the entries geometrically closest to the query vector, return those. The problem is that geometric closeness is not the same as relevance [2].

“VPN certificate expires in 30 days” sits close in embedding space to “session token expires after 24 hours.” “Annual performance review” sits close to “fraud review threshold.” “Parking validation updated” shares structure with “policy updated last quarter.”

These stale entries are not random noise. They are plausible noise — contextually adjacent to real queries in ways that cosine similarity cannot distinguish. As more of them accumulate, they collectively crowd the top-k retrieval slots away from the entries that actually matter. This is the core problem with dense-only retrieval at scale [5].

Phase 2 — Confidence Rises as Accuracy Falls

Now overlay confidence on the accuracy chart. This is where the problem becomes genuinely dangerous.

Memory Size	Accuracy	Avg Confidence
10 entries	50%	70.4%
25 entries	50%	71.7%
50 entries	50%	72.9%
100 entries	50%	74.7%
200 entries	40%	75.8%
500 entries	30%	78.0%

Accuracy drops 20 percentage points. Confidence rises 7.6 percentage points. They are inversely correlated across the entire range.

Think about what this means in production. Your monitoring dashboard shows confidence trending upward. Your on-call engineer sees no alert. Your users are receiving increasingly wrong answers with increasingly authoritative delivery.

Standard confidence measures retrieval coherence, not correctness. It’s just the mean similarity across retrieved entries. The more entries in the pool, the higher the probability that several of them achieve moderate similarity to any query, regardless of relevance. Mean similarity rises. Confidence follows. Accuracy does not get the memo.

This is the second failure mode: confidence is not a reliability signal. It is an optimism signal.

It tells you something matched — not that it was correct.

Phase 3 — One Stale Entry, One Wrong Answer, Zero Warning

Here is the failure made concrete. A specific query. A specific wrong answer. The exact similarity scores that caused it.

Query: “How do I reset a user account password?”
Correct answer: “Use POST /auth/reset with the user email.”

At 10 memory entries — working correctly:

[1] ✓ sim=0.457  turn=  2  POST /auth/reset resets user password via email
[2] ✓ sim=0.353  turn=  9  account locks after 5 failed login attempts
[3] ✓ sim=0.241  turn=  4  refund processed within 5 business days policy

Answer:  POST /auth/reset resets user password via email
Correct: True  |  Confidence: 73.2%

At 200 memory entries — silently broken:

[1] ✗ sim=0.471  turn=158  VPN certificate expires in 30 days notify users
[2] ✓ sim=0.457  turn=  2  POST /auth/reset resets user password via email
[3] ✓ sim=0.353  turn=  9  account locks after 5 failed login attempts

Answer:  VPN certificate expires in 30 days notify users
Correct: False  |  Confidence: 78.5%

The VPN certificate entry wins by a similarity margin of 0.014. The correct entry is still retrieved but is pushed to rank-2 by this narrow gap — enough to flip the final decision. That is the entire difference between a correct answer and a wrong one.

Why does a VPN entry beat a password reset entry for a password reset query? Because “VPN certificate expires… notify users” shares the tokens “users” and a structural proximity to “expires” / “reset” in this embedding space. The stale entry wins by token co-occurrence, not semantic relevance. Cosine similarity cannot see the difference. This is a well-documented failure mode of dense retrieval in long-context settings [3].

This is the third failure mode: stale entries win on raw similarity, and the margin is too small to detect.

Phase 4 — The Fix: Managed Memory Architecture

Flow diagram of a managed memory retrieval pipeline in a RAG system, showing stages: incoming query → topic routing (cluster filtering) → semantic deduplication (cosine similarity > 0.85) → relevance eviction with recency bonus → lexical reranking (BM25), ending with correct answer returned (similarity 0.608). — Structured memory pipeline improves retrieval precision with filtering, deduplication, and reranking layers in RAG systems. Image by Athor

The solution is not a better embedding model. It is not GPT-4 instead of GPT-3.5. It is four architectural mechanisms applied before and during retrieval. Together they break the assumption that cosine similarity equals relevance.

Input Fed In	Entries Retained	Relevance Rate	Accuracy
10	10	46%	70%
25	25	44%	80%
50	50	44%	60%
100	50	42%	60%
200	50	42%	60%
500	50	42%	60%

Feed in 50 entries or 500 — accuracy converges to ~60% after 50+ entries. At smaller input sizes the managed agent actually performs even better: 70% at 10 entries, 80% at 25 entries. The managed agent retains 50 entries from a 500-entry input and outperforms the agent sitting on all 500. Less context, correctly chosen, answers better.

Here is what makes that possible.

Mechanism 1 — Route the Query Before You Score It

Before any similarity computation, classify the query into a topic cluster. Each cluster has a centroid embedding computed from representative entries [5]. The query is matched to the nearest centroid, and only entries from that cluster enter the candidate set.

def _route_query_to_topic(query_emb: np.ndarray) -> str:
    best_topic = "payment_fraud"
    best_sim   = -1.0
    for topic, centroid in _TOPIC_CLUSTERS.items():
        sim = _cosine_sim(query_emb, centroid)
        if sim > best_sim:
            best_sim   = sim
            best_topic = topic
    return best_topic

The password reset query routes to the auth cluster. The VPN certificate entry belongs to off_topic. It never enters the candidate set. The problem in Phase 3 disappears before similarity scoring even begins.

This one mechanism eliminates cross-topic contamination entirely. It is also cheap — centroid comparison costs O(n_clusters), not O(n_memory).

Mechanism 2 — Collapse Near-Duplicates at Ingestion

Before entries are stored, near-duplicates are merged. If two entries have cosine similarity above 0.85, only the more recent one is kept.

def _deduplicate(self, entries: list[MemoryEntry]) -> list[MemoryEntry]:
    entries_sorted = sorted(entries, key=lambda e: e.turn)
    kept: list[MemoryEntry] = []
    for candidate in entries_sorted:
        is_dup = False
        for i, existing in enumerate(kept):
            if _cosine_sim(candidate.embedding, existing.embedding) > self.DEDUP_THRESHOLD:
                kept[i] = candidate   # replace older with newer
                is_dup = True
                break
        if not is_dup:
            kept.append(candidate)
    return kept

Without deduplication, the same stale content stored ten times across ten turns accumulates collective retrieval weight. Ten similar VPN-certificate entries push the off-topic cluster centroid toward auth space. Deduplication collapses them to one. The correct cluster boundaries survive.

Mechanism 3 — Evict by Relevance, Not by Age

When the retained pool must be capped, entries are scored by their maximum cosine similarity to any known topic cluster centroid. Entries that match no known query topic are evicted first. Within the retained set, a recency bonus (+0.0 to +0.12) breaks ties in favor of more recent entries.

def _topic_relevance_score(self, entry: MemoryEntry) -> float:
    return max(
        _cosine_sim(entry.embedding, centroid)
        for centroid in _TOPIC_CLUSTERS.values()
    )

This is the critical architectural inversion. Most implementations use a queue: oldest entries out, newest entries in. That is exactly backwards when the correct answers were stored at system initialization and the noise arrived later. A relevance-scored eviction policy keeps the answer to “what is the fraud threshold” — stored at turn 1 — over a catering order stored at turn 190. Recency is a tiebreaker, not the primary criterion.

Mechanism 4 — Separate Same-Topic Entries with Lexical Overlap

Topic routing and recency weighting still cannot separate two entries that belong to the same cluster but answer different questions. Both of these survive topic filtering for the fraud threshold query:

payment fraud threshold is $500 for review — correct ✓
Visa Mastercard Amex card payment accepted — wrong, but also payment_fraud ✗

Cosine similarity gives them similar scores. A BM25-inspired [1] lexical overlap bonus resolves this by rewarding entries whose content shares meaningful non-stop-word tokens with the query.

@staticmethod
def _lexical_overlap_bonus(query_text: str, entry: MemoryEntry) -> float:
    q_tokens = {
        w.strip("?.,!").lower()
        for w in query_text.split()
        if len(w.strip("?.,!")) > 3 and w.lower() not in _LEX_STOP
    }
    e_tokens = set(entry.content.lower().replace("/", " ").split())
    overlap  = len(q_tokens & e_tokens)
    return min(overlap * 0.05, 0.15)

The fraud threshold query contains “threshold.” The correct entry contains “threshold.” The wrong entry does not. A bonus of 0.05 tips the ranking. Multiply this effect across all ten queries and accuracy lifts measurably. This is the pattern known as hybrid retrieval [2] — dense embedding similarity combined with sparse lexical matching — implemented here as a lightweight reranking step that requires no second embedding pass.

All four mechanisms are load-bearing. Remove any one and accuracy degrades:

No routing → cross-topic stale entries re-enter competition
No deduplication → repeated stale content shifts cluster centroids
No relevance eviction → FIFO discards the oldest correct answers first
No lexical reranking → same-topic wrong entries win on coin-flip

The Final Score

Metric	Unbounded (200 entries)	Managed (50 retained)
Relevance rate	22%	42%
Accuracy	40%	60%
Avg confidence	75.8%	77.5%
Memory footprint	200 entries	50 entries

Side-by-side comparison of unbounded vs managed RAG memory, showing 200-entry memory with 78% stale/off-topic data and 40% accuracy, versus 50-entry managed memory with higher relevance distribution and 60% accuracy after eviction and filtering. — Memory control improves retrieval relevance and accuracy, preventing stale entries from dominating results in RAG systems. Image by Athor

The same query that returned a VPN certificate answer under unbounded memory now correctly returns the auth reset entry — similarity 0.608 versus the stale entry’s 0.471. Topic routing excluded the stale entry before it could compete. The correct answer wins by a comfortable margin instead of losing by a razor-thin one.

One-quarter of the memory. Twenty percentage points more accurate. The constraint is the feature.

What To Change in Your System (Starting Monday)

1. Stop using confidence as a correctness proxy. Instrument your agent with ground-truth evaluation — a small fixed set of known queries with verified answers — sampled on a schedule. Confidence tells you retrieval happened. It does not tell you retrieval worked.

2. Audit your eviction policy. If you are using FIFO or LRU eviction, you are discarding your oldest entries first. In most knowledge-base agents, those are your most valuable entries. Switch to relevance-scored eviction with recency as a tiebreaker.

3. Add a routing step before similarity scoring. Even a simple centroid-based cluster assignment dramatically reduces cross-topic contamination. This does not require retraining. It requires computing a centroid per topic cluster — a one-time offline step — and filtering candidates before scoring.

4. Run deduplication at ingestion. Repeated near-identical entries multiply their collective retrieval weight. Collapse them to the most recent version at write time, not at read time.

5. Add a lexical overlap bonus as a reranking step. If two entries score similarly on cosine similarity, a BM25-style token overlap bonus [1] will usually separate the one that actually shares vocabulary with the query from the one that merely shares topic. This is cheap to implement and does not require a second embedding pass.

Limitations

This simulation uses deterministic keyword-seeded embeddings, not a learned sentence encoder. Topic clusters are hand-labeled. The confidence model is a linear function of mean retrieval score. Real systems have higher-dimensional embedding spaces, learned boundaries, and calibrated probabilities that may behave differently at the margins.

These simplifications make the failure modes easier to observe, not harder. The structural causes — cosine similarity measuring coherence not correctness, FIFO eviction discarding relevant old entries, stale entries accumulating collective weight — persist regardless of embedding dimension or model scale [3]. The mechanisms described address those structural causes.

The accuracy numbers are relative comparisons within a controlled simulation, not benchmarks to generalize. The important quantities are the directions and magnitudes of change as memory scales.

Running the Code Yourself

pip install numpy scipy colorama

# Run the full four-phase demo
python llm_memory_leak_demo.py

# Suppress INFO logs
python llm_memory_leak_demo.py --quiet

# Run unit tests first (recommended — verifies correctness logic)
python llm_memory_leak_demo.py --test

Run --test before capturing output for replication. The TestAnswerKeywords suite verifies that each query’s correctness filter matches exactly one template entry — this is what closes the topic-level correctness loophole described in Phase 3.

Key Takeaways

Relevance collapses silently. At 10 entries, 44% of retrieved context is relevant. At 500 entries, 14% is. The agent keeps answering throughout.
Confidence is an optimism signal, not a reliability signal. It rises as accuracy falls. Your alert will never fire.
Stale entries win on margins you cannot see. A 0.014 cosine similarity gap is the difference between a correct answer and a VPN certificate.
Four mechanisms are required — not three. Topic routing, semantic deduplication, relevance-scored eviction, and lexical reranking each close a failure mode the others cannot.
Bounded memory beats unbounded memory. 50 well-chosen entries answer better than 200 accumulated ones. Less context, correctly chosen, is strictly better.

Final Thought

More memory doesn’t make LLM systems smarter.

It makes them more confident in whatever they retrieve.

If retrieval degrades, confidence becomes the most dangerous metric you have.

Disclosure

This article was written by the author. The companion code is original work. All experimental results are produced by running the published code; no results were manually adjusted. The author has no financial relationship with any tool, library, or company mentioned in this article.

References

[1] Robertson S, Zaragoza H (2009), “The Probabilistic Relevance Framework: BM25 and Beyond”. Foundations and Trends in Information Retrieval, Vol. 4 No. 1-2 pp. 1–174, doi: https://doi.org/10.1561/1500000019

[2] Yi Luan, Jacob Eisenstein, Kristina Toutanova, Michael Collins; Sparse, Dense, and Attentional Representations for Text Retrieval. Transactions of the Association for Computational Linguistics 2021; 9 329–345. doi: https://doi.org/10.1162/tacl_a_00369

[3] Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, Percy Liang; Lost in the Middle: How Language Models Use Long Contexts. Transactions of the Association for Computational Linguistics 2024; 12 157–173. doi: https://doi.org/10.1162/tacl_a_00638

[4] Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W., Rocktäschel, T., Riedel, S., & Kiela, D. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Advances in Neural Information Processing Systems, 33, 9459–9474. https://arxiv.org/abs/2005.11401

[5] Gao, L., Ma, X., Lin, J., & Callan, J. (2023). Precise Zero-Shot Dense Retrieval without Relevance Labels. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1762–1777. https://doi.org/10.18653/v1/2023.acl-long.99 (arXiv:2212.10496)

The companion code for this article is available at: https://github.com/Emmimal/memory-leak-rag/

All terminal output shown in this article was produced by running python llm_memory_leak_demo.py on the published code with no modifications.