Larger Context Windows Don’t Fix RAG — So I Built a System That Does

Contents

TL;DR I Trusted the Wrong Number Why RAG Cannot Aggregate The Benchmark: Two Pipelines, Same Query Error Observability Collapse The Semantic Engine: Proof That the Correct Answer Is Fast The Fix Is Not Better Retrieval Routing the Full Benchmark The Test Suite Honest Limitations What This Changes References Disclosure

TL;DR

I built a dataset Q&A system and trusted a RAG answer that was less than half-correct.
I measured this across 7 query types and 5 context sizes on 100,000 rows.
queries away from RAG entirely.

I Trusted the Wrong Number

Last month I was heads-down building a new feature for EmiTechLogic. Learners can now upload their own messy CSV files and ask questions in plain English about their data. Sounded perfect for RAG, so I went all in — embeddings, retrieval, nice-looking responses.

The first few demos looked amazing. Clean tables, confident numbers, professional formatting. I actually started trusting the system in our internal testing.

Then I picked one number to double-check.

Real grocery spend in the dataset: $1,140,033.24.

The model gave me a beautiful breakdown by category. It looked legit. I added up the numbers it returned.

It was less than half.

I sat there staring at the screen thinking “this can’t be right.” So I did what any engineer would do. I increased the context window. 4k… 16k… 32k… 128k tokens. Each time the answer got longer, more detailed, and more confidently wrong.

That’s when it finally clicked. This wasn’t a retrieval issue. I was asking a retrieval system to perform heavy computation on data it had only partially seen. And instead of saying it was unsure or missing information, the model was producing polished, structured answers that looked correct.

Why RAG Cannot Aggregate

The RAG pipeline doesn’t truly understand structured data. All it does is take each CSV row and flatten it into plain text. That’s it. To the model, a row looks something like this:

"2019-01-01 grocery_pos 107.23 F NC Jennifer Banks ..."

For a query like “What is the total spend by category?”, the RAG pipeline does this:

1. Tokenise: ["total", "spend", "category"]
2. Score all 100,000 rows by keyword overlap
3. Return the top-N rows as serialised plain text
4. Ask the LLM to sum and group from that text

Step 4 is where the system fails. The LLM is not running a SUM. It is pattern-matching numbers from a text blob and generating a response that mimics an aggregation.

Models struggle with numerical precision at scale [1], but the real issue is the presentation. The model gives you a detailed breakdown across all categories. This is a classic trap. The output looks professional. It mimics the structure of a real report so well that your brain assumes the content is valid. You have no way to verify that 92% of your data is missing.

RAG is a retrieval tool. It is not a calculation engine. Retrieval finds relevant fragments. Computation requires a full dataset scan. When you use RAG for math, you get a wrong answer that looks authoritative. That distinction is critical. A partial answer signals that data is missing. A complete-looking wrong answer just signals false confidence.

Complete code: https://github.com/Emmimal/context-window-engine/

The Benchmark: Two Pipelines, Same Query

To measure this precisely, I built a benchmark that runs two pipelines side by side for every query.

The first pipeline is a RAG simulation. It models what a naive vector pipeline passes to an LLM at five context sizes. I tested five context sizes, ranging from 5 rows up to 8,000. That scales from 325 tokens to 500,000. For each size, I tracked three metrics: how much data the LLM sees, what sum it computes from that specific slice, and whether a reader could actually spot the error.

The second pipeline is a semantic engine that executes the same query as a deterministic full-scan over all 100,000 rows and returns the exact correct answer.

Architectural comparison of query processing workflows, contrasting text-based RAG Simulation retrieval with structured data aggregation in a Semantic Engine. Image by Author.

The simulation does not reproduce exact LLM outputs. What it preserves is the key structural property: a partial slice of data fed into a system that returns a full-form answer. That is the property that causes the problem, and that is what the benchmark measures.

I chose seven query types to cover every aggregation pattern a structured data system is likely to encounter:

Query	Operation	Why it breaks RAG
Total spend by category	SUM + GROUP BY	Requires summing all rows across 14 groups
Highest average transaction by category	AVG + GROUP BY	Average changes with every missing row
Total spent on grocery_pos	SUM + categorical filter	Filter requires seeing all matching rows
How many female customers transacted	COUNT + filter	Count is meaningless on a partial scan
Total spend where amount > $500	SUM + numeric comparison	Threshold logic requires full data
State with lowest total spending	MIN + GROUP BY across 50 groups	Minimum can only be found with all groups present
Percentage of transactions that are fraudulent	COUNT + ratio	Ratio is undefined on a partial denominator

These queries are not unique or complex. They are the standard questions any analyst asks when looking at a new dataset. That is exactly why this failure is so critical.

Error Observability Collapse

Here is the full benchmark output for the query that started all of this. I am showing it in full because the numbers make the problem impossible to dismiss.

GROUND TRUTH (Semantic Engine)
SUM(amt) GROUP BY category → 14 groups
  #1  grocery_pos               1,140,033.24
  #2  shopping_net                773,527.93
  #3  shopping_pos                725,766.14
  #4  gas_transport               648,804.24
  #5  home                        556,526.53
Latency: 100.47ms | Rows scanned: 100,000

RAG SIMULATION — what the LLM receives at each context size

Context               Rows   Coverage    Partial sum  Error detectable?
tiny   (~325 tokens)     5   0.0050%         197.73  EASY
small  (~3K tokens)     50   0.0500%       2,003.56  MODERATE
medium (~32K tokens)   500   0.5000%      31,023.21  HARD
large  (~130K tokens) 2,000  2.0000%     140,093.16  VERY HARD
xlarge (~520K tokens) 8,000  8.0000%     569,368.22  NEAR IMPOSSIBLE

I stared at these results for a while. The most troubling part wasn’t just that the answers were wrong, it was how much harder the errors became to spot as the context window grew.

At 8,000 rows the error was still over 50%, yet the response looked like a professional report. You’d need to manually verify the numbers to notice something was off. That’s what I started calling Error Observability Collapse. The more context I gave the model, the more convincing — but not more accurate — the output became.

The “Partial sum” column shows the total if the LLM added every amount value in the rows it actually retrieved. The “Error detectable?” column scores how likely a human reader is to spot a mistake.

With 5 rows, the partial sum is 197.73. The correct total is 1,140,033.24. It is obvious. The output is short, the numbers are wrong, and the missing data is clear. The error is instant.

At 8,000 rows, the partial sum hits 569,368.22. The LLM has now seen all 14 categories. It generates a 1,500-word report with specific figures and confident language. The error is 50%, but it is hidden inside authoritative, well-structured prose. Without an external reference, a reader has no way to catch it.

This is the pattern that held across all seven queries:

Context Window	Rows	Dataset Coverage	Response Length	Error Detectable?
~325 tokens	5	0.005%	~50 words	YES — obviously a guess
~3K tokens	50	0.050%	~150 words	MAYBE
~32K tokens	500	0.500%	~400 words	HARD
~130K tokens	2,000	2.000%	~800 words	VERY HARD
~520K tokens	8,000	8.000%	~1,500 words	NEAR IMPOSSIBLE
Semantic Engine	100,000	100%	<200ms	N/A — exact

I called this Error Observability Collapse. As context grows, confidence scales with it. Correctness does not.

Flowchart and trend lines demonstrating the effects of increasing context size in LLMs. The graphic shows that more context leads to higher confidence and lower error detectability, while overall accuracy remains flat. — The illusion of context: How larger context windows in RAG and LLM systems increase user confidence and decrease error detectability without improving actual accuracy. Image by Author.

The failure modes are asymmetric, which makes them dangerous:

A wrong RAG answer looks correct. It is formatted, specific, and confident. A failed computation throws an explicit error. It is visible.

One failure is silent. The other is loud. As context windows reach millions of tokens, the silent failure becomes harder to detect [4]. The system does not get safer as it scales. It just gets more convincing.

The Semantic Engine: Proof That the Correct Answer Is Fast

Before I fully understood the problem, I had already thrown together a simple semantic engine out of frustration. I just wanted the correct answer at least once.

The approach turned out to be simple: parse the query into proper operations and run a single pass over the entire dataset. No embeddings, no retrieval, no guessing.

Here’s what that looks like in practice:

The logic is simple. Take a query like “What is the total spend by category?”. The engine maps this to a direct operation: SUM(amt) GROUP BY category. It processes the full 100,000-row set in a single pass. It accumulates grouped totals. There is no retrieval. No inference. No partial scanning. It visits every row once and returns the exact result.

This proves that the correct answer is not expensive. Benchmark queries finished under 200ms. Sample size: 100,000 rows. Aggregation is trivial. The failure happens when you route those queries to a system built to misunderstand them.

from context_window_engine import compute_ground_truth, load_csv

rows = load_csv("data/credit_card_transactions.csv", max_rows=100_000)

gt = compute_ground_truth(
    query_label = "total by category",
    rows        = rows,
    agg_func    = "sum",
    agg_col     = "amt",
    group_col   = "category",
)
# gt.answer     → [(grocery_pos, 1140033.24), (shopping_net, 773527.93), ...]
# gt.latency_ms → 100.47

Engine supports SUM, AVG, COUNT, MIN, MAX. Handles categorical and numeric filters. Includes GROUP BY and ratios. Zero external dependencies. Every operation runs as a deterministic function over the full list.

The engine itself isn’t the product. It’s the proof: the correct answer is reachable under a second. No inference required. The real challenge is routing queries there reliably.

The Fix Is Not Better Retrieval

Stop trying to improve retrieval. If a query needs 100% of the data, an 8% sample fails. The fix is removing retrieval from the loop.

We need a classification layer. It sits before the pipeline and makes one binary call: computation or lookup?

The difference is clear. “Total spend by category” demands a full scan. “Find transactions from Jennifer Banks” is a simple lookup. Standard RAG forces both down the same path. That is the design flaw.

A QueryRouter fixes this. It inspects every incoming query and routes it to the correct path before a single retrieval begins.

Architectural flowchart illustrating a QueryRouter classifying incoming queries. The router splits workloads into a blue-coded Computation path for analytic queries handled by a SemanticEngine, and a green-coded Retrieval path for search queries handled by a RAG pipeline. — Intent-based query routing architecture, separating analytical calculation intents from semantic information retrieval pipelines. Image by Auhor.

The classifier uses three signal tiers, prioritized. Tier 1: aggregation verbs—total, how many, average, lowest, percentage. These demand full-dataset computation. Tier 2: numeric comparison—greater than 500, above $1,000, at least. These imply filter-then-aggregate, impossible for RAG. Tier 3: retrieval signals—find, show me, list, fetch. These indicate lookups where semantic similarity works.

Tier	Signal	Examples	Route
1	Aggregation verb	`total`, `how many`, `average`, `lowest`, `percentage`	COMPUTATION
2	Numeric comparison	`greater than 500`, `above $1,000`, `at least`	COMPUTATION
3	Retrieval signal	`find`, `show me`, `list`, `fetch`	RETRIEVAL
0	No match	ambiguous	COMPUTATION — safer default

Default to COMPUTATION if no tier matches. This is deliberate. Failure modes are asymmetric: a wrong RAG answer on an aggregation is silently wrong. A computation engine that can’t parse a query throws an error. When in doubt, fail loudly.

from query_router import QueryRouter

router = QueryRouter(rows)

result = router.route("What is the total spend by category?")
# result.routed_to     → "COMPUTATION"
# result.answer.answer → [(grocery_pos, 1140033.24), ...]
# result.total_latency → ~250ms — classify + execute combined

result = router.route("Find transactions from Jennifer Banks")
# result.routed_to     → "RETRIEVAL"
# result.answer.safe   → True — RAG is appropriate

Routing the Full Benchmark

I ran nine queries through the router to verify performance across both types: seven aggregation queries destined for the semantic engine, and two lookup queries for RAG.

Every route was correct. The seven aggregation queries hit the full-scan engine and returned exact results. The two lookup queries correctly triggered the RAG path. Look at the output: high confidence scores, correct pattern matching, and latency under 130ms—even with the 100,000-row scan.

[1] ✓  COMPUTATION   "What is the total spend by category?"
     Tier 1 | matched='total' | confidence=0.97
     #1 grocery_pos      1,140,033.24  (102.57ms | 100,000 rows | exact)

[2] ✓  COMPUTATION   "Which category has the highest average transaction amount?"
     Tier 1 | matched='highest' | confidence=0.97
                               71.91  (119.47ms | 100,000 rows | exact)

[3] ✓  COMPUTATION   "What is the total amount spent on grocery_pos?"
     Tier 1 | matched='total' | confidence=0.97
                        1,140,033.24  (49.96ms  | 100,000 rows | exact)

[4] ✓  COMPUTATION   "How many transactions were made by female customers?"
     Tier 1 | matched='How many' | confidence=0.97
                           54,641.00  (90.45ms  | 100,000 rows | exact)

[5] ✓  COMPUTATION   "What is the total spend where amount is greater than 500?"
     Tier 1 | matched='total' | confidence=0.97
                        1,274,269.60  (91.65ms  | 100,000 rows | exact)

[6] ✓  COMPUTATION   "Which state has the lowest total spending?"
     Tier 1 | matched='lowest' | confidence=0.97
     lowest RI               2,125.60  (109.05ms | 100,000 rows | exact)

[7] ✓  COMPUTATION   "What percentage of transactions are fraudulent?"
     Tier 1 | matched='percentage' | confidence=0.97
                              0.9900%  (87.35ms  | 100,000 rows | exact)

[8] ✓  RETRIEVAL     "Find transactions from Jennifer Banks"
     Tier 3 | matched='Find' | confidence=0.85
     RAG is appropriate — no aggregation required

[9] ✓  RETRIEVAL     "Show me a sample transaction from Texas"
     Tier 3 | matched='Show me' | confidence=0.85
     RAG is appropriate — no aggregation required

Routing accuracy: 9/9

9/9 correct. Error Observability Collapse is impossible if aggregation queries never reach RAG.

The Test Suite

The benchmark verifies nine specific queries. The test suite ensures reliability across a broader range: edge cases, malformed inputs, missing data, and common production failure points.

The engine suite has 87 tests across 10 classes. It covers float parsing with dollar signs, commas, and scientific notation; all five aggregation functions under normal conditions and with empty inputs; all five numeric filter operators; full GROUP BY aggregation with categorical and numeric filters combined; RAG simulation coverage metrics at each context size; and edge cases including empty datasets, rows with missing column values, and single-row inputs.

The router suite has 72 tests across 5 classes. It covers all three tier patterns, including edge cases like all-caps queries and very long queries; natural language to typed operation parsing for every supported query form; routing and execution correctness against all seven benchmark queries; and a contrast suite that verifies router answers match independent ground-truth computation — ensuring the router does not introduce any deviation from the engine’s own output.

Run the engine tests by typing python space -m space unittest space test_engine space -v. This executes the 87 tests in the suite.

Run the router tests by typing python space -m space unittest space test_router space -v. This executes the 72 tests in the suite.

All 159 pass on Python 3.9+ with zero external dependencies.

Honest Limitations

This solution isn’t perfect. It only works on single CSV files right now. Real production datasets are usually messy with multiple tables that need joining — I deliberately kept the scope small because I wanted something that actually worked end-to-end first.

The router is also still pretty basic (regex-based). I tried a small LLM-based classifier early on but it was inconsistent and added latency, so I went back to the simple approach. Sometimes the boring solution wins.

I also simulated the RAG responses instead of hitting real APIs for the benchmark. The patterns hold up, but your mileage with GPT-4o or Claude 3.5 will vary slightly.

CSV format required. The engine loads structured data directly from CSV files. Database connections, Parquet files, and other tabular formats are not supported at this time.

What This Changes

Adding a routing layer costs almost nothing. Classifying a query against 65 regex patterns takes just microseconds. The semantic engine adds less than 200ms to scan a 100,000 row dataset. The total overhead is smaller than a single embedding call.

What you get in return is a deterministic answer for every aggregation query. Every total, every count, and every percentage now comes from a full scan instead of a confident approximation based on 8 percent of the data. RAG keeps handling what it is actually good at: retrieving specific records, surfacing relevant passages, and answering lookup questions where semantic similarity is the right tool for the job.

RAG is not broken. It is just being asked to compute, and it cannot do that.
The dangerous part is not that it fails. It is that it fails convincingly. And no amount of context changes that.

You can try typing it out like this:

To start, clone the repository using git clone followed by the URL https://github.com/Emmimal/context-window-engine/. Once that finishes, move into the directory by typing cd context-window-engine. Finally, launch the project by running python demo.py in your terminal.

References

[1] Levy, M., Jacoby, A., & Goldberg, Y. (2024). Same task, more tokens: The impact of input length on the reasoning performance of large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15339–15353, Bangkok, Thailand. Association for Computational Linguistics.
https://doi.org/10.18653/v1/2024.acl-long.818

[2] Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.-t., Rocktäschel, T., Riedel, S., & Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances in Neural Information Processing Systems, 33, 9459–9474. https://doi.org/10.48550/arXiv.2005.11401

[3] Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J.,
Guo, Q., Wang, M., & Wang, H. (2023). Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997.
https://doi.org/10.48550/arXiv.2312.10997

[4] Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2023). Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12, 157–173. https://doi.org/10.1162/tacl_a_00638

[5] Koshorek, O., Granot, N., Alloni, A., Admati, S., Hendel, R., Weiss, I., Arazi, A., Cohen, S.-N., & Belinkov, Y. (2025). Structured RAG for answering aggregative questions. arXiv preprint arXiv:2511.08505.
https://doi.org/10.48550/arXiv.2511.08505

Disclosure

All benchmark numbers are from actual runs on Python 3.12.6, Windows 11, CPU only, no GPU. The benchmark uses the Credit Card Transactions Fraud Detection dataset (Kartik Gajjar, Kaggle, 2020), a synthetic dataset generated using the Sparkov transaction simulator created by Brandon Harris, licensed CC0 (Public Domain), available at kaggle.com/datasets/kartik2112/fraud-detection. The RAG baseline simulates retrieval and models confidence signals — no real LLM API calls are made. No external API keys are required to reproduce any result in this article. All code described here was written and tested by me.