Your RAG System Retrieves the Right Data — But Still Produces Wrong Answers. Here’s Why (and How to Fix It).

Contents

The Exactly as Designed. The Answer Was Still Wrong.

I want to tell you about the moment I stopped trusting retrieval scores.

I was running a query against a knowledge base I had built carefully. Good chunking. Hybrid search. Reranking. The top-k documents came back with cosine similarities as high as 0.86. Every indicator said the pipeline was working. I passed those documents to a QA model, got a confident answer, and moved on.

The answer was wrong.

Not hallucinated-wrong. Not retrieval-failed-wrong. The right documents had come back. Both of them. A preliminary earnings figure and the audited revision that superseded it, sitting side by side in the same context window. The model read both, chose one, and reported it with 80% confidence. It had no mechanism to tell me it had been asked to referee a dispute it was never designed to judge.

That is the failure mode this article is about. It does not show up in your retrieval metrics. It does not trigger your hallucination detectors. It lives in the gap between context assembly and generation — the one step in the RAG pipeline that almost nobody evaluates.

I built a reproducible experiment to isolate it. Everything in this article runs on a CPU in about 220 MB. No API key. No cloud. No GPU. The output you see in the terminal screenshots is unmodified.

Complete Source Code: https://github.com/Emmimal/rag-conflict-demo

What the Experiment Tests

The setup is deliberately clinical. Three questions. One knowledge base containing three conflicting document pairs that make directly contradictory claims about the same fact. Retrieval is tuned to return both conflicting documents every time.

The question is not whether retrieval works. It does. The question is: what does the model do when you hand it a contradictory brief and ask it to answer with confidence?

The answer, as you will see, is that it picks a side. Silently. Confidently. Without telling you it had a choice to make.

RAG systems can retrieve the right documents but still produce incorrect answers due to hidden conflicts during context assembly. Image by Author.

Three Scenarios, Each Drawn from Production

Scenario A — The restatement nobody told the model about

A company’s Q4 earnings release reports annual revenue of $4.2M for fiscal year 2023. Three months later, external auditors restate that figure to $6.8M. Both documents live in the knowledge base. Both are indexed. When someone asks “What was Acme Corp’s revenue for fiscal year 2023?” — both come back, with similarity scores of 0.863 and 0.820 respectively.

The model answers $4.2M.

It chose the preliminary figure over the audited revision because the preliminary document scored marginally higher in retrieval. Nothing about the answer signals that a more authoritative source disagreed.

Scenario B — The policy update that arrived too late

A June 2023 HR policy mandates three days per week in-office. A November 2023 revision explicitly reverses it — fully remote is now permitted. Both documents are retrieved (similarity scores 0.806 and 0.776) when an employee asks about the current remote work policy.

The model answers with the June policy. The stricter, older rule. The one that no longer applies.

Scenario C — The API docs that never got deprecated

Version 1.2 of an API reference states a rate limit of 100 requests per minute. Version 2.0, published after an infrastructure upgrade, raises it to 500. Both are retrieved (scores 0.788 and 0.732).

The model answers 100. A developer using this answer to configure their rate limiter will throttle themselves to one-fifth of their actual allowance.

None of these are edge cases. Every production knowledge base accumulates exactly these patterns over time: financial restatements, policy revisions, versioned documentation. The pipeline has no layer that detects or handles them.

Running the Experiment

pip install -r requirements.txt
python rag_conflict_demo.py

requirements.txt

sentence-transformers>=2.7.0   # all-MiniLM-L6-v2  (~90 MB)
transformers>=4.40.0           # deepset/minilm-uncased-squad2 (~130 MB)
torch>=2.0.0                   # CPU-only is fine
numpy>=1.24.0
colorama>=0.4.6

Two models. One for embeddings, one for extractive QA. Both download automatically on first run and cache locally. Total: ~220 MB. No authentication required.

Phase 1: What Naive RAG Does

Here is the unmodified terminal output from Phase 1 — standard RAG with no conflict handling:

────────────────────────────────────────────────────────────────────
  NAIVE  |  Scenario A — Numerical Conflict
────────────────────────────────────────────────────────────────────
  Query       : What was Acme Corp's annual revenue for fiscal year 2023?
  Answer      : $4.2M
  Confidence  : 80.3%
  Conflict    : YES — see warning

  Sources retrieved
    [0.863] Q4-2023-Earnings-Release            (2024-01-15)
    [0.820] 2023-Annual-Report-Revised          (2024-04-03)
    [0.589] Company-Overview-2024               (2024-01-01)

  Conflict pairs
    fin-001  ↔  fin-002
    numerical contradiction  (topic_sim=0.83)
    [Q4-2023-Earnings-Release: {'$4.2M'}]  vs  [2023-Annual-Report-Revised: {'$6.8M'}]
────────────────────────────────────────────────────────────────────

────────────────────────────────────────────────────────────────────
  NAIVE  |  Scenario B — Policy Conflict
────────────────────────────────────────────────────────────────────
  Query       : What is the current remote work policy for employees?
  Answer      : all employees are required to be present in the office
                a minimum of 3 days per week
  Confidence  : 78.3%
  Conflict    : YES — see warning

  Sources retrieved
    [0.806] HR-Policy-June-2023                 (2023-06-01)
    [0.776] HR-Policy-November-2023             (2023-11-15)
    [0.196] HR-Policy-November-2023             (2023-11-15)
────────────────────────────────────────────────────────────────────

────────────────────────────────────────────────────────────────────
  NAIVE  |  Scenario C — Technical Conflict
────────────────────────────────────────────────────────────────────
  Query       : What is the API rate limit for the standard tier?
  Answer      : 100 requests per minute
  Confidence  : 81.0%
  Conflict    : YES — see warning

  Sources retrieved
    [0.788] API-Reference-v1.2                  (2023-02-10)
    [0.732] API-Reference-v2.0                  (2023-09-20)
    [0.383] API-Reference-v2.0                  (2023-09-20)
────────────────────────────────────────────────────────────────────

A dark-themed terminal window showing Phase 1 output from rag_conflict_demo.py. All three scenarios return wrong or outdated answers with confidence scores between 78% and 81%. Each scenario shows the conflict pair that was detected but not resolved. — Retrieval succeeded every time. The QA model still answered from whichever conflicting document it attended to most — silently and confidently. Image by Author.

Three questions. Three wrong answers. Confidence between 78% and 81% on every one of them.

Notice what is happening in the logs before each response:

09:02:20 | WARNING  | Conflict detected: {('fin-001', 'fin-002'): "numerical contradiction..."}
09:02:24 | WARNING  | Conflict detected: {('hr-001', 'hr-002'): "contradiction signal asymmetry..."}
09:02:25 | WARNING  | Conflict detected: {('api-001', 'api-002'): "contradiction signal asymmetry..."}

The conflicts are detected. They are logged. And then, because resolve_conflicts=False, the pipeline passes the full contradictory context to the model and answers anyway. That warning is going nowhere. In a production system without a conflict detection layer, you would not even get the warning.

Why the Model Behaves This Way

This requires a moment of explanation, because the model is not broken. It is doing exactly what it was trained to do.

deepset/minilm-uncased-squad2 is an extractive QA model. It reads a context string and selects the span with the highest combined start-logit and end-logit score. It has no output class for “I see two contradictory claims.” When the context contains both $4.2M and $6.8M, the model computes token-level scores across the entire string and selects whichever span wins.

That selection is driven by factors that have nothing to do with correctness [8]. The two primary drivers are:

Position bias. Earlier spans in the context receive marginally higher attention scores due to the encoder architecture. The preliminary document ranked higher in retrieval and therefore appeared first.

Language strength. Direct declarative statements (“revenue of $4.2M”) outscore hedged or conditional phrasing (“following restatement… is $6.8M”).

A third contributing factor is lexical alignment — spans whose vocabulary overlaps more closely with the question tokens score higher regardless of whether the underlying claim is current or authoritative.

Critically, what the model does not consider at all: source date, document authority, audit status, or whether one claim supersedes another. These signals are simply invisible to the extractive model.

A diagram showing the three retrieved documents concatenated into a context string. The QA model assigns a higher confidence score to the $4.2M span from the first document because it appears earlier and uses direct declarative language, even though the $6.8M figure from the second document is more recent and authoritative. — The model has no mechanism to weigh source date or audit authority. It picks the span with the highest confidence score — and position wins. Image by Author.

The same dynamic plays out in generative LLMs, but less visibly — the model paraphrases rather than extracting verbatim spans, so the wrong answer is dressed in fluent prose. The mechanism is the same. Joren et al. (2025) demonstrate at ICLR 2025 that frontier models including Gemini 1.5 Pro, GPT-4o, and Claude 3.5 frequently produce incorrect answers rather than abstaining when retrieved context is insufficient to answer the query — and that this failure is not reflected in the model’s expressed confidence.

The failure is not a model deficiency. It is an architectural gap: the pipeline has no stage that detects contradictions before handing context to generation.

Building the Conflict Detection Layer

Diagram of a five-component RAG system architecture showing Document, KnowledgeBase, ConflictDetector, RAGPipeline, and RAGResponse with data flow and internal processing steps. — A modular RAG pipeline architecture showing document ingestion, embedding-based retrieval, conflict detection, QA processing, and structured response generation. Image by Author.

The detector sits between retrieval and generation. It examines every pair of retrieved documents and flags contradictions before the QA model sees the context. Crucially, embeddings for all retrieved documents are computed in a single batched forward pass before pair comparison begins — each document is encoded exactly once, regardless of how many pairs it participates in.

Two heuristics do the work.

Heuristic 1: Numerical Contradiction

Two topic-similar documents that contain non-overlapping meaningful numbers are flagged. The implementation filters out years (1900–2099) and bare small integers (1–9), which appear ubiquitously in enterprise text and would generate constant false positives if treated as claim values.

@classmethod
def _extract_meaningful_numbers(cls, text: str) -> set[str]:
    results = set()
    for m in cls._NUM_RE.finditer(text):
        raw = m.group().strip()
        numeric_core = re.sub(r"[$€£MBK%,]", "", raw, flags=re.IGNORECASE).strip()
        try:
            val = float(numeric_core)
        except ValueError:
            continue
        if 1900 <= val <= 2099 and "." not in numeric_core:
            continue   # skip years
        if val < 10 and re.fullmatch(r"\d+", raw):
            continue   # skip bare small integers
        results.add(raw)
    return results

Applied to Scenario A: fin-001 yields {'$4.2M'}, fin-002 yields {'$6.8M'}. Empty intersection — conflict detected.

Heuristic 2: Contradiction Signal Asymmetry

Two documents discussing the same topic, where one contains contradiction tokens the other does not, are flagged. The token set splits into two groups kept as separate frozenset objects:

_NEGATION_TOKENS: “not”, “never”, “no”, “cannot”, “doesn’t”, “isn’t”, and related forms
_DIRECTIONAL_TOKENS: “increased”, “decreased”, “reduced”, “eliminated”, “removed”, “discontinued”

These are unioned into CONTRADICTION_SIGNALS. Keeping them separate makes domain-specific tuning straightforward — a legal corpus might need a broader negation set; a changelog corpus might need more directional tokens.

Applied to Scenario B: hr-002 contains “no” (from “no longer required”); hr-001 does not. Asymmetry detected. Applied to Scenario C: api-002 contains “increased”; api-001 does not. Asymmetry detected.

Both heuristics require topic_sim >= 0.68 before firing. This threshold gates out unrelated documents that happen to share a number or a negation word. The 0.68 value was calibrated for this document set with all-MiniLM-L6-v2 — treat it as a starting point, not a universal constant. Different embedding models and different domains will require recalibration.

The Resolution Strategy: Cluster-Aware Recency

When conflicts are detected, the pipeline resolves them by keeping the most recently timestamped document from each conflict cluster. The key design decision is cluster-aware.

A top-k result may contain multiple independent conflict clusters — two financial documents disagreeing on revenue and two API documents disagreeing on rate limits, all in the same top-3 result. A naive approach — keep only the single most recent document from the combined conflicting set — would silently discard the winning document from every cluster except the most recently published one overall.

Instead, the implementation builds a conflict graph, finds connected components via iterative DFS, and resolves each component independently:

@staticmethod
def _resolve_by_recency(
    contexts: list[RetrievedContext],
    conflict: ConflictReport,
) -> list[RetrievedContext]:
    # Build adjacency list
    adj: dict[str, set[str]] = defaultdict(set)
    for a_id, b_id in conflict.conflict_pairs:
        adj[a_id].add(b_id)
        adj[b_id].add(a_id)

    # Connected components via iterative DFS
    visited: set[str] = set()
    clusters: list[set[str]] = []
    for start in adj:
        if start not in visited:
            cluster: set[str] = set()
            stack = [start]
            while stack:
                node = stack.pop()
                if node not in visited:
                    visited.add(node)
                    cluster.add(node)
                    stack.extend(adj[node] - visited)
            clusters.append(cluster)

    all_conflicting_ids = set().union(*clusters) if clusters else set()
    non_conflicting = [c for c in contexts if c.document.doc_id not in all_conflicting_ids]

    resolved_docs = []
    for cluster in clusters:
        cluster_ctxs = [c for c in contexts if c.document.doc_id in cluster]
        # ISO-8601 timestamps sort lexicographically — max() gives most recent
        best = max(cluster_ctxs, key=lambda c: c.document.timestamp)
        resolved_docs.append(best)

    return non_conflicting + resolved_docs

Non-conflicting documents pass through unchanged. Each conflict cluster contributes exactly one winner.

Phase 2: What Conflict-Aware RAG Does

────────────────────────────────────────────────────────────────────
  RESOLVED  |  Scenario A — Numerical Conflict
────────────────────────────────────────────────────────────────────
  Query       : What was Acme Corp's annual revenue for fiscal year 2023?
  Answer      : $6.8M
  Confidence  : 79.6%
  Conflict    : RESOLVED

  ⚠  Conflicting sources detected — answer derived from most recent
     document per conflict cluster.

  Sources retrieved
    [0.820] 2023-Annual-Report-Revised          (2024-04-03)
    [0.589] Company-Overview-2024               (2024-01-01)

  Conflict cluster resolved: kept '2023-Annual-Report-Revised' (2024-04-03),
  discarded 1 older doc(s).
────────────────────────────────────────────────────────────────────

────────────────────────────────────────────────────────────────────
  RESOLVED  |  Scenario B — Policy Conflict
────────────────────────────────────────────────────────────────────
  Answer      : employees are no longer required to maintain
                a fixed in-office schedule
  Confidence  : 78.0%
  Conflict    : RESOLVED

  Conflict cluster resolved: kept 'HR-Policy-November-2023' (2023-11-15),
  discarded 1 older doc(s).
────────────────────────────────────────────────────────────────────

────────────────────────────────────────────────────────────────────
  RESOLVED  |  Scenario C — Technical Conflict
────────────────────────────────────────────────────────────────────
  Answer      : 500 requests per minute
  Confidence  : 80.9%
  Conflict    : RESOLVED

  Conflict cluster resolved: kept 'API-Reference-v2.0' (2023-09-20),
  discarded 1 older doc(s).
────────────────────────────────────────────────────────────────────

Terminal-style diagram showing a conflict-aware RAG system correctly resolving numerical, policy, and technical conflicts across three scenarios and producing correct answers. — A conflict-aware RAG system resolves contradictions in retrieved documents and produces correct, up-to-date answers across financial, HR, and API queries. Image by Author.

Three questions. Three correct answers. The confidence scores are almost identical to Phase 1 — 78–81% — which underscores the original point: confidence was never the signal that something had gone wrong. It still is not. The only thing that changed is the architecture.

A three-row comparison table showing the same query answered by Naive RAG and Conflict-Aware RAG side by side. Naive RAG returns $4.2M, 3 days/week in-office, and 100 requests per minute — all wrong. Conflict-Aware RAG returns $6.8M, fully remote permitted, and 500 requests per minute — all correct. — Same retriever, same model, same query. The only difference is whether conflict detection runs before context is handed to the QA model. Image by Author.

What the Heuristics Cannot Catch

I want to be precise about the failure envelope, because a method that understates its own limitations is not useful.

Paraphrased conflicts. The heuristics catch numerical differences and explicit contradiction tokens. They will not catch “the service was retired” versus “the service is currently available.” That is a real conflict with no numeric difference and no negation token. For these, a Natural Language Inference model — cross-encoder/nli-deberta-v3-small at ~80 MB — can score entailment versus contradiction between sentence pairs. This is the more robust path described in the academic literature (Asai et al., 2023), and the ConflictDetector class is designed to be extended at the _pair_conflict_reason method for exactly this purpose.

Non-temporal conflicts. Recency-based resolution is appropriate for versioned documents and policy updates. It is not appropriate for expert opinion disagreements (the minority view may be correct), cross-methodology data conflicts (recency is irrelevant), or multi-perspective queries (where surfacing both views is the right response). In these cases, the ConflictReport data structure provides the raw material to build a different response — surfacing both claims, flagging for human review, or asking the user for clarification.

Scale. Pair comparison is O(k²) in retrieved documents. For k=3 this is trivial; for k=20 it is still fine. For pipelines retrieving k=100 or more, pre-indexing known conflict pairs or cluster-based detection becomes necessary.

Where the Research Community Is Taking This

What you have seen here is a practical heuristic approximation of a problem that active research is attacking at a much more sophisticated level.

Cattan et al. (2025) introduced the CONFLICTS benchmark — the first specifically designed to track how models handle knowledge conflicts in realistic RAG settings. Their taxonomy identifies four conflict categories — freshness, conflicting opinions, complementary information, and misinformation — each requiring distinct model behaviour. Their experiments show that LLMs frequently fail to resolve conflicts appropriately across all categories, and that explicitly prompting models to reason about potential conflicts substantially improves response quality, though substantial room for improvement remains.

Ye et al. (2026) introduced TCR (Transparent Conflict Resolution), a plug-and-play framework that disentangles semantic relevance from factual consistency via dual contrastive encoders. Self-answerability estimation gauges confidence in the model’s parametric memory, and the resulting scalar signals are injected into the generator via lightweight soft-prompt tuning. Across seven benchmarks, TCR improves conflict detection by 5–18 F1 points while adding only 0.3% parameters.

Gao et al. (2025) introduced CLEAR (Conflict-Localized and Enhanced Attention for RAG), which probes LLM hidden states at the sentence representation level to detect where conflicting knowledge manifests internally. Their analysis reveals that knowledge integration occurs hierarchically and that conflicting versus aligned knowledge exhibits distinct distributional patterns within sentence-level representations. CLEAR uses these signals for conflict-aware fine-tuning that guides the model toward accurate evidence integration.

The consistent finding across all of this work matches what this experiment demonstrates directly: retrieval quality and answer quality are distinct dimensions, and the gap between them is larger than the community has historically acknowledged.

The difference between that research and this article is 220 MB and no authentication.

What You Should Actually Do With This

1. Add a conflict detection layer before generation. The ConflictDetector class is designed to drop into an existing pipeline at the point where you assemble your context string. Even the two simple heuristics here will catch the patterns that appear most often in enterprise corpora: restatements, policy updates, versioned documentation.

2. Distinguish conflict types before resolving. A temporal conflict (use the newer document) is a different problem from a factual dispute (flag for human review) or an opinion conflict (surface both views). A single resolution strategy applied blindly creates new failure modes.

3. Log every ConflictReport. After a week of production traffic you will know how often your specific corpus generates conflicting retrieved sets, which document pairs conflict most frequently, and what query patterns trigger conflicts. That data is more actionable than any synthetic benchmark.

4. Surface uncertainty when you cannot resolve it. The right answer to an unresolvable conflict is not to pick one and hide the choice. The warning field in RAGResponse is there precisely to support responses like: “I found conflicting information on this topic. The June 2023 policy states X; the November 2023 update states Y. The November document is more recent.”

Running the Full Demo

# Full output with INFO logs
python rag_conflict_demo.py

# Demo output only (suppress model loading logs)
python rag_conflict_demo.py --quiet

# Run unit tests without downloading models
python rag_conflict_demo.py --test

# Plain terminal output for log capture / CI
python rag_conflict_demo.py --no-color

All output shown in this article is unmodified output from a local Windows machine running Python 3.9+ in a virtual environment. The code and output are fully reproducible by any reader with the listed dependencies installed.

The Takeaway

The retrieval problem is largely solved. Vector search is fast, accurate, and well-understood. The community has spent years optimising it.

The context-assembly problem is not solved. Nobody is measuring it. The gap between “correct documents retrieved” and “correct answer produced” is real, it is common, and it produces confident wrong answers with no signal that anything went wrong.

The fix does not require a larger model, a new architecture, or additional training. It requires one additional pipeline stage, running on embeddings you already have, at zero marginal latency.

The experiment above runs in about thirty seconds on a laptop. The question is whether your production system has the equivalent layer — and if not, what it is silently answering wrong right now.

References

[1] Ye, H., Chen, S., Zhong, Z., Xiao, C., Zhang, H., Wu, Y., & Shen, F. (2026). Seeing through the conflict: Transparent knowledge conflict handling in retrieval-augmented generation. arXiv:2601.06842. https://doi.org/10.48550/arXiv.2601.06842

[2] Asai, A., Wu, Z., Wang, Y., Sil, A., & Hajishirzi, H. (2023). Self-RAG: Learning to retrieve, generate, and critique through self-reflection. arXiv:2310.11511. https://doi.org/10.48550/arXiv.2310.11511

[3] Cattan, A., Jacovi, A., Ram, O., Herzig, J., Aharoni, R., Goldshtein, S., Ofek, E., Szpektor, I., & Caciularu, A. (2025). DRAGged into conflicts: Detecting and addressing conflicting sources in search-augmented LLMs. arXiv:2506.08500. https://doi.org/10.48550/arXiv.2506.08500

[4] Gao, L., Bi, B., Yuan, Z., Wang, L., Chen, Z., Wei, Z., Liu, S., Zhang, Q., & Su, J. (2025). Probing latent knowledge conflict for faithful retrieval-augmented generation. arXiv:2510.12460. https://doi.org/10.48550/arXiv.2510.12460

[5] Jin, Z., Cao, P., Chen, Y., Liu, K., Jiang, X., Xu, J., Li, Q., & Zhao, J. (2024). Tug-of-war between knowledge: Exploring and resolving knowledge conflicts in retrieval-augmented language models. arXiv:2402.14409. https://doi.org/10.48550/arXiv.2402.14409

[6] Joren, H., Zhang, J., Ferng, C.-S., Juan, D.-C., Taly, A., & Rashtchian, C. (2025). Sufficient context: A new lens on retrieval augmented generation systems. arXiv:2411.06037. https://doi.org/10.48550/arXiv.2411.06037

[7] Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., … & Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. arXiv:2005.11401. https://doi.org/10.48550/arXiv.2005.11401

[8] Mallen, A., Asai, A., Zhong, V., Das, R., Khashabi, D., & Hajishirzi, H. (2023). When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. arXiv:2212.10511. https://doi.org/10.48550/arXiv.2212.10511

[9] Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence embeddings using Siamese BERT-networks. arXiv:1908.10084. https://doi.org/10.48550/arXiv.1908.10084

[10] Xu, R., Qi, Z., Guo, Z., Wang, C., Wang, H., Zhang, Y., & Xu, W. (2024). Knowledge conflicts for LLMs: A survey. arXiv:2403.08319. https://doi.org/10.48550/arXiv.2403.08319

[11] Xie, J., Zhang, K., Chen, J., Lou, R., & Su, Y. (2023). Adaptive chameleon or stubborn sloth: Revealing the behavior of large language models in knowledge conflicts. arXiv:2305.13300. https://doi.org/10.48550/arXiv.2305.13300

Complete Source Code: https://github.com/Emmimal/rag-conflict-demo

Models Used

Both models download automatically on first run and cache locally. No API key or HuggingFace authentication is required.

Disclosure

All code was written, debugged, and validated by the author through multiple iterations of real execution. All terminal output in this article is unmodified output from a local Windows machine running Python 3.9+ in a virtual environment. The code and output are fully reproducible by any reader with the listed dependencies installed.

The author has no financial relationship with Hugging Face, deepset, or any organisation referenced in this article. Model and library choices were made solely on the basis of size, licence, and CPU compatibility.