Zero-Waste Agentic RAG: Designing Caching Architectures to Minimize Latency and LLM Costs at Scale

Contents

The Setup: A Dual-Source Agentic System The Two-Tier Cache Architecture Tier 1: The Semantic Cache (At query level)Tier 2: The Retrieval Cache (Context Level)The Intelligent Router: Agent Construction & Tooling Real-World Scenarios Scenario 1: The Semantic Cache Hit (Speed & Cost)Scenario 2: Retrieval Cache (Shared Context)Scenario 3: Agentic Cache Bypass Scenario 4: Row-Level Staleness Detection Scenario 5: Table-Level Staleness (Aggregations)Scenario 6: Staleness Detection via Data Fingerprinting Scenario 7: Retrieval Cache Fallback (Context Sufficiency)Scenario 8: Predicate Caching (Time-Bounded Validation)Conclusion Reference

-Augmented Generation (RAG) has moved out of the experimental phase and firmly into enterprise production. We are no longer just building chatbots to test LLM capabilities; we are constructing complex, agentic systems that interface directly with internal structured databases (SQL), unstructured knowledge lakes (Vector DBs), and third-party APIs and MCP tools. However, as RAG adoption scales within an organization, a glaring and expensive problem is evident — redundancy.

In many enterprise RAG deployments, teams observe that over 30% of user queries are repetitive or semantically similar. Employees across different departments ask for the same Q4 sales numbers, the same onboarding procedures, and the same summaries of standard vendor contracts. External users asking about health insurance premiums for their age often receive responses that are identical across similar profiles.

In a naive RAG architecture, every single one of these repeated questions triggers an identical, expensive chain of events: generating embeddings, executing vector similarity searches, scanning SQL tables, retrieving massive context windows, and forcing a Large Language Model (LLM) to reason over the exact same tokens to produce an answer it generated an hour ago.

This redundancy inflates cloud infrastructure costs and adds unnecessary multi-second latencies to user responses. We need an intelligent caching strategy to control costs and keep RAG viable as the user and query volume increases.

However, caching for Agentic RAG is not a simple `key: value` store. Language is nuanced, data is highly dynamic, and serving a stale or hallucinated cache is a real risk. In this article, I will demonstrate a caching architecture with real-world scenarios that can bring tangible benefits.

The Setup: A Dual-Source Agentic System

Let us consider a simulated enterprise environment using a dataset of Amazon Product Reviews (CC0).

Our Agentic RAG system acts as an intelligent router equipped with access to two data stores:
1. A Structured SQL Database (SQLite): Contains tabular review data (Id, ProfileName, Score, Time, Summary, Review Text).
2. An Unstructured Vector Database (FAISS): Contains the embedded text payload of the reviews of products by customers. This simulates internal knowledge bases, wikis, and policy documents.

The Two-Tier Cache Architecture

We utilize a Two-Tier Cache architecture because users rarely ask exactly the same question verbatim, but they frequently ask questions with the same meaning, and therefore, requiring the same underlying context.

Tier 1: The Semantic Cache (At query level)

The Semantic Cache acting as the first line of defense, intercepting the user query. Unlike a traditional cache that requires a perfect string match (e.g., caching `SELECT * FROM table`), a Semantic Cache uses embeddings.

When a user asks a question, we embed the query and compare it against previously cached queries using cosine similarity. If the new query is semantically identical—say, a similarity score of > 95% —we immediately return the previously generated LLM answer. For instance:
Query A: “What is the company leave policy?”
Query B: “Can you tell me the policy for taking time off?”
The Semantic Cache recognizes these as identical intents. It intercepts the request before the Agent is even invoked, resulting in an answer that is delivered in milliseconds with zero LLM token costs.

Tier 2: The Retrieval Cache (Context Level)

Let’s consider the user asks the query in the following way:
Query C: “Summarize the leave policy specifically for remote workers.”

This is not a 95% match, so it misses Tier 1. However, the underlying documents needed to answer Query C are exactly the same documents retrieved for Query A. This is where Tier 2, the Retrieval Cache, activates.

The Retrieval Cache stores the raw data blocks (SQL rows or FAISS text chunks) against a broader “Topic Match” threshold (e.g., > 70%). When the Semantic Cache misses, the agent checks Tier 2. If it finds relevant pre-fetched context, it skips the expensive database lookups and directly feeds the cached context into the LLM to generate a fresh answer. It acts as a high-speed notepad.

The Intelligent Router: Agent Construction & Tooling

Fetching from the caches is not enough. We need to have mechanisms to detect staleness of the saved content in the cache, to prevent incorrect responses to the user. To orchestrate retrieval and validation from the two-tier cache and the dual-source backends, the system relies on an LLM Agent. Rather than a RAG agent that only acts as the response synthesizer given the context, here the agent is provided with a rigorous system prompt and a specific set of tools that allow it to act as an intelligent query router and data validator.

The agent toolkit consists of several custom functions it can autonomously invoke based on the user’s intent:

search_vector_database: Queries the Vector DB (FAISS) for unstructured text.
query_sql_database: Executes dynamic SQL queries against the local SQLite database to fetch exact numbers or filtered data.
check_retrieval_cache: Pulls pre-fetched context for >70% similar topics to skip Vector/SQL lookups.
check_source_last_updated: Quickly queries the live SQL database to get the exact MAX(Time) timestamp. Helps to detect if the source ‘reviews’ table has been updated for global aggregation queries (eg: What is the average score across all reviews?)
check_row_timestamp: Validates the Date-Time parameter of a specific row ID.
check_data_fingerprint: Calculates the Hash of a document’s content to detect changes. Useful when there is no Date-Time column or for a distributed database.
check_predicate_staleness: Checks if a specific “slice” of data (e.g., a specific year) has changed.

This tool-calling architecture transforms the LLM from a passive text generator into an active, self-correcting data manager. The following scenarios will depict how these tools are used for specific types of queries to manage cost and accuracy of responses. The figure depicts the query flow across all the scenarios covered here.

Query Decision Flow

Real-World Scenarios

Scenario 1: The Semantic Cache Hit (Speed & Cost)

This is the ideal scenario, where a question from one user is almost identically repeated by another user (>95% similarity). For eg; a user asks the system: “What are the common opinions about coffee taste?”. Since it is the first time the system has seen this question, it results in a cache MISS. The agent methodically queries the Vector Search, retrieves three documents, and the LLM spends 36 seconds reasoning over the text to generate a comprehensive summary of bitter versus delicious coffee profiles.

A moment later, a second user asks the same question. The system generates an embedding, looks at the Semantic Cache, and registers a hit. The exact answer is returned instantly.

The net impact is a response time drop from ~36.0 seconds to 0.02 seconds. Total token cost for the second query: $0.00.

Here is the query flow.

============================================================
==== Scenario 1: The Semantic Cache Hit (Speed & Cost) =====
============================================================
-> Asking it the FIRST time (expect Cache MISS, slow LLM + DB lookups)
[USER]: What are the common opinions about coffee taste?
[SYSTEM]: Semantic Cache MISS / BYPASSED. Routing to Agent...
   [TOOL: RetrievalCache]: Checking cache for topic: 'common opinions about coffee taste'
   [TOOL: RetrievalCache]: MISS. Topic not found in cache.
   [TOOL: VectorSearch]: Searching for 'common opinions about coffee taste'...
   [TOOL: VectorSearch]: Found 3 documents. Saving to Retrieval Cache.
[AGENT]: Based on the reviews, common opinions about coffee taste vary. Some find it to have a bitter taste, while others describe it as great tasting and delicious. There are also opinions that coffee can be stale and lacking in flavor. Some consumers are also concerned about achieving the full flavor potential of their coffee.
[TIME TAKEN]: 36.13 seconds
-> Asking it the SECOND time (expect Semantic Cache HIT, instant)
[USER]: What are the common opinions about coffee taste?
[SYSTEM]: Semantic Cache HIT -> Based on the reviews, common opinions about coffee taste vary. Some find it to have a bitter taste, while others describe it as great tasting and delicious. There are also opinions that coffee can be stale and lacking in flavor. Some consumers are also concerned about achieving the full flavor potential of their coffee.
[TIME TAKEN]: 0.02 seconds

Scenario 2: Retrieval Cache (Shared Context)

Next, the user asks a follow-up: “Summarize these opinions into 3 bullet points.”

The Semantic Cache registers a MISS because the intent (summarization format) is fundamentally different. However, the semantic topic is highly similar (>70%). The system hits the Tier 2 Retrieval Cache, pulls the exact same 3 documents fetched in Scenario 1 , and passes them to the LLM to format into bullets.
The net impact is we eliminate the latency and cost of vector database nearest-neighbor searching, keeping the data retrieval strictly in-memory.

Here is the query flow.

============================================================
===== Scenario 2: Retrieval Cache Hit (Shared Context) =====
============================================================
-> Ensuring Retrieval Cache is seeded (silent check)...
[USER]: What are the common opinions about coffee taste?
[SYSTEM]: Semantic Cache HIT -> Based on the reviews, common opinions about coffee taste vary. Some find it to have a bitter taste, while others describe it as great tasting and delicious. There are also opinions that coffee can be stale and lacking in flavor. Some consumers are also concerned about achieving the full flavor potential of their coffee.

-> Asking a DIFFERENT question on the SAME TOPIC.
-> Semantic query is slightly different so Semantic cache misses.
-> Agent should hit Retrieval Cache to avoid FAISS lookup and answer it.
[USER]: Summarize these coffee taste opinions in a bulleted list.
[SYSTEM]: Semantic Cache MISS / BYPASSED. Routing to Agent...
   [TOOL: RetrievalCache]: Checking cache for topic: 'coffee taste opinions'
   [TOOL: RetrievalCache]: HIT! Found cached context (Document ID: 481389
[AGENT]: Here's a summary of the coffee taste opinions:

*   One user found the coffee to have a "weird whang" and a bitter taste, expressing disappointment.
*   Another user enjoyed the coffee, describing it as "great tasting" and "delicious" when made in a drip coffee maker, though they were unsure if they were achieving its full flavor potential due to a lack of brewing instructions.
*   A third user was greatly disappointed, finding the coffee stale and lacking in flavor.
[TIME TAKEN]: 34.24 seconds

Scenario 3: Agentic Cache Bypass

If the user query is about latest analytics, such as current trends or latest sales figures, it is advisable to bypass the cache entirely. In this scenario, the user queries: “What are the LATEST negative reviews?”

In this case, the Agentic router inspects the user query and understands the temporal intent. Based on the system prompt, it then explicitly decides to bypass the cache entirely. The query is routed straight to the source SQL database to ensure up-to-date context for building the response.

Here is the query flow.

============================================================
======= Scenario 3: Agentic Bypass for 'Latest' Data =======
============================================================
-> Asking for 'latest' data.
-> Agent prompt logic should explicitly bypass cache and go to SQL.
[USER]: What are the latest 5 star reviews?
[SYSTEM]: Semantic Cache MISS / BYPASSED. Routing to Agent...
[AGENT]: Here are the latest 5-star reviews:

*   **Score:** 5, **Summary:** YUM, **Text:** Skinny sticks go a little too fast in my household!.. continued

Scenario 4: Row-Level Staleness Detection

Data is not static. And therefore there needs to be a validation of the cache contents before use.

Let’s say a user asks: “What is the summary of the review with ID 120698?” The system caches the answer.

Subsequently, an administrator updates the database, changing the summary text for the same ID. When the user asks the exact same question again, the Semantic Cache identifies a 100% match. However, it does not blindly serve the answer.

Every cache entry is stored with a Validation Strategy Tag. Before returning the hit, the system triggers the check_row_timestamp agent tool. It quickly checks the Time column for ID 120698 in the live database. Seeing that the live database timestamp is newer than the cache’s creation timestamp, the system triggers an Invalidation. It drops the stale cache, forces an agentic query to the database, and retrieves the corrected summary.

Here is the query flow. I have added an additional check to show that updating an unrelated row does not invalidate the cache.

============================================================
== Scenario 4: Staleness Detection (Row-Level Timestamp) ===
============================================================
-> Step 1: Initial Ask (Expect MISS, Agent fetches from SQL)
[USER]: Provide a detailed summary of review ID 120698.
[SYSTEM]: Semantic Cache MISS / BYPASSED. Routing to Agent...
   [TOOL: RetrievalCache]: Checking cache for topic: 'review ID 120698'
   [TOOL: RetrievalCache]: MISS. Topic not found in cache.
[AGENT]: The review for ID 120698 is summarized as "Burnt tasting garbage"..contd. 

-> Step 2: Asking again (Expect HIT - Data is Fresh)
[USER]: Provide a detailed summary of review ID 120698.
[SYSTEM]: Semantic Cache HIT (Fresh Row Timestamp) -> The review for ID 120698 is summarized as "Burnt tasting garbage"..contd.. 

-> Step 3: Simulating Background Update (Unrelated ID 99999)...
-> Testing retrieval AFTER unrelated change (Expect HIT - Row is still fresh):
[USER]: Provide a detailed summary of review ID 120698.
[SYSTEM]: Semantic Cache HIT (Fresh Row Timestamp) -> The review for ID 120698 is summarized as "Burnt tasting garbage"..contd..

-> Now updating the target review (Row 120698) itself...
   [REAL-TIME UPDATE]: New Timestamp in DB: 27-02-2026 03:53:00
-> Testing Semantic Cache retrieval for Row 120698 AFTER its own update:
-> EXPECTATION: Stale cache detected (Row-Level). Invalidating.
[USER]: Provide a detailed summary of review ID 120698.
[SYSTEM]: Stale cache detected (Row 120698 updated at 27-02-2026 03:53:00). Invalidating.
[SYSTEM]: Semantic Cache MISS / BYPASSED. Routing to Agent...
   [TOOL: RetrievalCache]: Checking cache for topic: 'review ID 120698'
   [TOOL: RetrievalCache]: MISS. Topic not found in cache.
[AGENT]: The UPDATED review for ID 120698 is summarized as "Burnt tasting garbage"..contd..

Scenario 5: Table-Level Staleness (Aggregations)

Row-level validation works well for single lookups, but not on queries requiring aggregations on a large number of rows. For eg;
a user asks: “How many total reviews are in the database?” or “What is the average score for all reviews?”. And then another user asks it again. In this case, checking the timestamp of thousands of rows would be highly inefficient. Instead, the Semantic Cache tags aggregation queries with a Table MAX Time validation strategy. When the same question is asked again, the agent uses check_source_last_updated tool to check SELECT MAX(Time) FROM reviews. If it sees a new source table timestamp, it invalidates the cache and recalculates the total count accurately.

Here is the query flow.

============================================================
====== Scenario 5: Staleness Detection (Table-Level) =======
============================================================
-> Step 1: Initial Ask (Expect MISS, Agent performs global count)
[USER]: How many total reviews are in the database?
[SYSTEM]: Semantic Cache MISS / BYPASSED. Routing to Agent...
   [TOOL: RetrievalCache]: Checking cache for topic: 'total number of reviews'
   [TOOL: RetrievalCache]: MISS. Topic not found in cache.
[AGENT]: There are 205 total reviews in the database.

-> Step 2: Asking again (Expect HIT - Table is Fresh)
[USER]: How many total reviews are in the database?
[SYSTEM]: Semantic Cache HIT (Fresh Source Timestamp) -> There are 205 total reviews in the database.

-> Adding a brand new review record (id 11111) with a FRESH timestamp...
-> Testing Global Cache retrieval AFTER table change:
-> EXPECTATION: Stale cache detected (Source-Level). Invalidating.
[USER]: How many total reviews are in the database?
[SYSTEM]: Stale cache detected (Source 'reviews' updated at 27-02-2026 08:03:26). Invalidating.
[SYSTEM]: Semantic Cache MISS / BYPASSED. Routing to Agent...
   [TOOL: RetrievalCache]: Checking cache for topic: 'total number of reviews'
   [TOOL: RetrievalCache]: MISS. Topic not found in cache.
[AGENT]: There are 206 total reviews in the database.

Scenario 6: Staleness Detection via Data Fingerprinting

Sometimes, databases don’t have reliable updated_at timestamps, or we are dealing with unstructured text files or a distributed database. In this scenario, we rely on cryptography. A user queries: “What does review ID 120698 say?” The system caches the response alongside a SHA-256 Hash of the underlying source text.

When the text is altered without updating a timestamp, the Semantic Cache catches a hit. Using check_data_fingerprint tool, it attempts validation by comparing the cached SHA-256 hash against a fresh hash of the live source text. The hash mismatch throws a red flag, safely invalidating the silent edit.

Here is the query flow.

============================================================
== Scenario 6: Staleness Detection (Data Fingerprinting) ===
============================================================
-> Step 1: Initial Ask (Expect MISS, Agent fetches text)
[USER]: What is the exact text of review ID 120698?
[SYSTEM]: Semantic Cache MISS / BYPASSED. Routing to Agent...
[AGENT]: The exact text of review ID 120698 is: 'The worst coffee beverage I\'ve..contd.'

-> Step 2: Asking again (Expect HIT - Hash is Valid)
[USER]: What is the exact text of review ID 120698?
[SYSTEM]: Semantic Cache HIT (Valid Hash) -> The exact text of review ID 120698 is: 'The worst coffee beverage I\'ve ..contd.

-> Modifying the underlying source text without timestamp in SQL DB...
-> Testing Semantic Cache retrieval AFTER content change:
-> EXPECTATION: Stale cache detected (Hash mismatch). Invalidating.
[USER]: What is the exact text of review ID 120698?
[SYSTEM]: Stale cache detected (Hash mismatch). Invalidating cache and re-running.
[SYSTEM]: Semantic Cache MISS / BYPASSED. Routing to Agent...
[AGENT]: The exact text of review ID 120698 is: 'The worst coffee beverage I\'ve ..contd.

Scenario 7: Retrieval Cache Fallback (Context Sufficiency)

While the Tier 2 context cache is a powerful tool, sometimes the context may only have half the answer to the user question.

For example, a user asks: “What is the sentiment about packaging of the coffee?” The system searches, and the Vector database returns documents exclusively talking about the packaging of the coffee. This is cached.

Next, the user asks: “What do people think about the packaging and the taste of the coffee?”

The system hits the Retrieval Cache based on topic similarity and passes the documents to the LLM. But the agent is instructed to evaluate Sufficiency by the check_retrieval_cache tool. The agent analyzes the cached context and realizes that the context only has information about packaging, but not the taste of the coffee.
Instead of hallucinating an answer about taste, the agent triggers a Context Fallback. It discards the cache, generates a new query specifically targeting “coffee taste” and “coffee packaging”, queries the live Vector DB, and merges the result to provide a flawless, fact-based answer.

Here is the query flow.

============================================================
 Scenario 7: Retrieval Cache Fallback (Context Sufficiency) 
============================================================
-> Step 1: Seeding Retrieval Cache with NARROW context (Packaging only) for a BROAD topic...
-> Step 2: Asking a BROAD question ('packaging' AND 'taste').
-> EXPECTATION:
[USER]: What do people think about the packaging and the actual taste of the coffee?
[SYSTEM]: Semantic Cache MISS / BYPASSED. Routing to Agent...
   [TOOL: RetrievalCache]: Checking cache for topic: 'packaging and taste of coffee'
   [TOOL: RetrievalCache]: HIT! Found cached context (Review 1: The box arrived slightly dented but the internal wrap was secure.
   [TOOL: VectorSearch]: Searching for 'packaging of the coffee'...
   [TOOL: VectorSearch]: Found 3 documents. Saving to Retrieval Cache.
   [TOOL: VectorSearch]: Searching for 'taste of the coffee'...
   [TOOL: VectorSearch]: Found 3 documents. Saving to Retrieval Cache.
[AGENT]: People have mixed opinions on the packaging and taste of the coffee.

Regarding **packaging**:
*   Some customers have received products with damaged packaging, such as a "crushed box" and "coffee dust all over the K-cups."
*   Others have noted issues with the clarity of information on the packaging"

Regarding the **actual taste of the coffee**:
*   Several reviews describe the taste negatively, with comments like "very bitter," 
*   One reviewer simply stated it "tastes like instant coffee."
[TIME TAKEN]: 7.34 seconds

Scenario 8: Predicate Caching (Time-Bounded Validation)

Finally, we can apply an advanced staleness invalidation logic to optimize cache retrievals. Here is an example.

A user asks: “How many reviews were written in 2011?”

Since this is a global query involving a large number of rows, table-level staleness check (scenario 5) applies. However, if someone adds a review for the year 2026, the entire table’s MAX(Time) changes, and the 2011 cache would be invalidated and cleared. That is not efficient.

Instead, we employ Predicate Caching. The cache entry records the specific SQL WHERE clause constraint (e.g., Time BETWEEN start_of_2011 AND end_of_2011).

When a new 2026 review is added, using the check_predicate_staleness tool, the system checks the MAX(Time) only within the 2011 slice. Seeing that the 2011 slice is undisturbed, it safely returns a Cache HIT. Only when a review specifically dated for 2011 is inserted does the predicate validation flag it as stale, ensuring highly targeted, efficient invalidation.

Here is the query flow.

============================================================
= Scenario 8: Predicate Caching (Time-Bounded Validation) ==
============================================================
-> Step 1: Initial Ask (Expect MISS, Agent executes filtered SQL)
[USER]: How many reviews were written in 2011?
[SYSTEM]: Semantic Cache MISS / BYPASSED. Routing to Agent...
[AGENT]: There were 59 reviews written in 2011.

-> Step 2: Asking again (Expect HIT - Predicate slice is fresh)
[USER]: How many reviews were written in 2011?
[SYSTEM]: Semantic Cache HIT (Fresh Predicate Marker) -> There were 59 reviews written in 2011.

-> Step 3: Adding a NEW review for a DIFFERENT year (2026)...
-> Testing Semantic Cache for 2011 AFTER an unrelated 2026 update:
-> EXPECTATION: Semantic Cache HIT (The 2011 slice is unchanged!)
[USER]: How many reviews were written in 2011?
[SYSTEM]: Semantic Cache HIT (Fresh Predicate Marker) -> There were 59 reviews written in 2011.

-> Step 4: Adding a NEW review WITHIN the 2011 time slice...
-> Testing Semantic Cache for 2011 AFTER a related 2011 update:
-> EXPECTATION: Stale cache detected (Predicate marker changed). Invalidating.
[USER]: How many reviews were written in 2011?
[SYSTEM]: Stale cache detected (Predicate 'Time >= 1293840000 AND Time <= 1325375999' marker changed). Invalidating.
[SYSTEM]: Semantic Cache MISS / BYPASSED. Routing to Agent...
[AGENT]: There were 60 reviews written in 2011.

Conclusion

In this article, we demonstrated how redundancy silently inflates latency and token spend in production RAG systems. We walked through a dual-source agentic setup combining structured SQL data and unstructured vector search, and showed how repeated queries unnecessarily trigger identical retrieval and generation pipelines.

To solve this, we introduced a validation-aware, two-tier caching architecture:

Tier 1 (Semantic Cache) eliminates repeated LLM reasoning by serving semantically identical answers instantly.
Tier 2 (Retrieval Cache) avoids redundant database and vector searches by reusing previously fetched context.
Agentic validation layers—temporal bypass, row-level and table-level checks, cryptographic hashing, predicate-aware invalidation, and context sufficiency evaluation—ensure that efficiency does not come at the cost of correctness.

The result is a system that is not only faster and cheaper, but also smarter and safer.

As enterprises scale a RAG system, the difference between a prototype RAG system and a production-grade one will not be model size, but architectural discipline and efficiency. Intelligent caching transforms Agentic RAG from a reactive pipeline into a self-optimizing knowledge engine.

Connect with me and share your comments at www.linkedin.com/in/partha-sarkar-lets-talk-AI

Reference

Amazon Product Reviews — Dataset by Arham Rumi (Owner) (CC0: Public Domain)

_{Images used in this article are generated using Google Gemini. Figures and underlying code created by me.}