Beyond Prompt Caching: 5 More Things You Should Cache in RAG Pipelines

Editor
19 Min Read


, we talked in detail about what Prompt Caching is in LLMs and how it can save you a lot of money and time when running AI-powered apps with high traffic. But apart from Prompt Caching, the concept of a cache can also be utilized in several other parts of AI applications, such as RAG retrieval caching or caching of entire query-response pairs, providing further cost and time savings. In this post, we are going to take a look in more detail at what other components of an AI app can benefit from caching mechanisms. So, let’s take a look at caching in AI beyond Prompt Caching.


Why does it make sense to cache other things?

So, Prompt Caching makes sense because we expect system prompts and instructions to be passed as input to the LLM, in exactly the same format every time. But beyond this, we can also expect user queries to be repeated or look alike to some extent. Especially when talking about deploying RAG or other AI apps within an organization, we expect a large portion of the queries to be semantically similar, or even identical. Naturally, groups of users within an organization are going to be interested in similar things most of the time, like ‘how many days of annual leave is an employee entitled to according to the HR policy‘, or ‘what is the process for submitting travel expenses‘. Nevertheless, statistically, it is highly unlikely that multiple users will ask the exact same query (the exact same words allowing for an exact match), unless we provide them with proposed, standardized queries within the UI of the app. Nonetheless, there is a very high chance that users ask queries with different words that are semantically very similar. Thus, it makes sense to also think of a semantic cache apart from the conventional cache.

In this way, we can further distinguish between the two types of cache:

  • Exact-Match Caching, that is, when we cache the original text or some normalized version of it. Then we hit cache only with exact, word-for-word matches of the text. Exact-match caching can be implemented using a KV cache like Redis.
  • Semantic Caching, that is, creating an embedding of the text. Then we hit cache with any text that is semantically similar to it and exceeds a predefined similarity score threshold (like cosine similarity above ~0.95). Since we are interested in the semantics of the texts and we perform a similarity search, a vector database, such as ChromaDB, would need to be used as a cache store.

Unlike Prompt Caching, where we get to use a cache integrated into the API service of the LLM, to implement caching in other stages of a RAG pipeline, we have to use an external cache store, like Redis or ChromaDB mentioned above. While this is a bit of a hassle, as we need to set up those cache stores ourselves, it also provides us with more control over the parametrization of the cache. For instance, we get to decide about our Cache Expiration policies, meaning how long a cached item remains valid and can be reused. This parameter of the cache memory is defined as Time-To-Live (TTL).

As illustrated in my previous posts, a very simple RAG pipeline looks something like this:

Even in the simplest form of a RAG pipeline, we already use a caching-like mechanism without even realizing it. That is, storing the embeddings in a vector database and retrieving them from there, instead of making requests to an embedding model every time and recalculating the embeddings. This is very straightforward and essentially a non-negotiable part (it would be silly of us to not do it) even of a very simple RAG pipeline, because the embeddings of the documents generally remain the same (we need to recalculate an embedding only when a document of the knowledge base is altered), so it makes sense to calculate once and store it somewhere.

But apart from storing the knowledge base embeddings in a vector database, other parts of the RAG pipeline can also be reused, and we can benefit from applying caching to them. Let’s see what those are in more detail!

. . .

1. Query Embedding Cache

The first thing that is done in a RAG system when a query is submitted is that the query is transformed into an embedding vector, so that we can perform semantic search and retrieval against the knowledge base. Apparently, this step is very lightweight in comparison to calculating the embeddings of the entire knowledge base. Nonetheless, in high-traffic applications, it can still add unnecessary latency and cost, and in any case, recalculating the same embeddings for the same queries over and over again is wasteful.

So, instead of computing the query embedding every time from scratch, we can first check if we have already computed the embedding for the same query before. If yes, we simply reuse the cached vector. If not, we generate the embedding once, store it in the cache, and make it available for future reuse.

In this case, our RAG pipeline would look something like this:

The most straightforward way to implement query embedding caching is by looking for the exact-match of the raw user query. For example:

What area codes correspond to Athens, Greece?

Nevertheless, we can also use a normalized version of the raw user query by performing some simple operations, like making it lowercase or stripping punctuation. In this way, the following queries…

What area codes correspond to athens greece?
What area codes correspond to Athens, Greece
what area codes correspond to Athens // Greece?

… would all map to …

what area codes correspond to athens greece?

We then search for this normalized query in the KV store, and if we get a cache hit, we can then directly use the embedding that is stored in the cache, with no need to make a request to the embedding model again. That is going to be an embedding looking something like this, for example:

[0.12, -0.33, 0.88, ...]

In general, for the query embedding cache, the key-values have the following format:

query → embedding

As you may already imagine, the hit for this can significantly improve if we propose the users with standardized queries within the app’s UI, beyond letting them type their own queries in free text.

. . .

2. Retrieval Cache

Caching can also be utilized at the retrieval step of an RAG pipeline. This means that we can cache the retrieved results for a specific query and minimize the need to perform a full retrieval for similar queries. In this case, the key of the cache may be the raw or normalized user query, or the query embedding. The value we get back from the cache is the retrieved document chunks. So, our RAG pipeline with retrieval caching, either exact-match or semantic, would look something like this:

So for our normalized query…

what area codes correspond to athens greece?

or from the query embedding…

[0.12, -0.33, 0.88, ...]

we would directly get back from the cache the retrieved chunks.

[
 chunk_12,
 chunk_98,
 chunk_42
]

In this way, when an identical or even somewhat similar query is submitted, we already have the relevant chunks and documents in the cache — there is no need to perform the retrieval step. In other words, even for queries that are only moderately similar (for example, cosine similarity above ~0.85), the exact response may not exist in the cache, but the relevant chunks and documents needed to answer the query often do.

In general, for the retrieval cache, the key-values have the following format:

query → retrieved_chunks

One may wonder how this is different from the query embedding cache. After all, if the query is the same, why not directly hit the cache in the retrieval cache and also include a query embedding cache? The answer is that in practice, the query embedding cache and the retrieval cache may have different TTL policies. That is because the documents in the knowledge base may change, and even if we have the same query or the same query embedding, the corresponding chunks may be different. This explains the usefulness of the query embedding cache existing individually.

. . .

3. Reranking Cache

Another way to utilize caching in the context of RAG is by caching the results of the reranker model (if we use one). More specifically, this means that instead of passing the retrieved ranked results to a reranker model and getting back the reranked results, we directly get the reranked order from the cache, for a specific query and retrieved chunks. In this case, our RAG pipeline would look something like this:

In our Athens area codes example, for our normalized query:

what area codes correspond to athens greece?

and hypothetical retrieved and ranked chunks

[
 chunk_12,
 chunk_98,
 chunk_42
]

we could directly get the reranked chunks as output of the cache:

[
chunk_98,
chunk_12,
chunk_42
]

In general, for the reranking cache, the keys and values have the following format:

(query + retrieved_chunks) → reranked_chunks

Again, one may wonder: if we hit the reranking cache, shouldn’t we also always hit the retrieval cache? At first glance, this might seem true, but in practice, it is not necessarily the case.

One reason is that, as explained already, different caches may have different TTL policies. Even if the reranking result is still cached, the retrieval cache may have already expired and require performing the retrieval step from scratch.

But beyond this, in a complex RAG system, we most probably are going to use more than one retrieval mechanism (e.g., semantic search, BM25, etc.). As a result, we may hit the retrieval cache for one of the retrieval mechanisms, but not for all, and thus not hit the cache for reranking. Vice versa, we may hit the cache for reranking, but miss on the individual caches of the various retrieval mechanisms — we may end up with the same set of documents, but by retrieving different documents from each individual retrieval mechanism. For these reasons, the retrieval and reranking caches are conceptually and practically different.

. . .

4. Prompt Assembly Cache

Another useful place to apply caching in a RAG pipeline is during the prompt assembly stage. That is, once retrieval and reranking are completed, the relevant chunks are combined with the system prompt and the user query to form the final prompt that is sent as input to the LLM. So, if the query, system prompt, and reranked chunks all match, then we hit cache. This means that we don’t need to reconstruct the final prompt again, but we can get parts of it (the context) or even the entire final prompt directly from cache.

Caching the prompt assembly step in a RAG pipeline would look something like this:

Continuing with our Athens example, suppose the user submits the query…

what area codes correspond to athens greece?

and after retrieval and reranking, we get the following chunks (either from the reranker or the reranking cache):

[
chunk_98,
chunk_12,
chunk_42
]

During the prompt assembly step, these chunks are combined with the system prompt and the user query to construct the final prompt that will be sent to the LLM. For example, the assembled prompt may look something like:

System: You are a helpful assistant that answers questions using the provided context.

Context:
[chunk_98]
[chunk_12]
[chunk_42]

User: what area codes correspond to athens greece?

In general, for the prompt assembly cache, the key values have the following format:

(query + system_prompt + retrieved_chunks) → assembled_prompt

Apparently, the computational savings here are smaller compared to the other caching layers mentioned above. Nonetheless, context caching can still reduce latency and simplify prompt construction in high-traffic systems. In particular, prompt assembly caching makes sense to implement in systems where prompt assembly is complex and includes more operations than a simple concatenation, like inserting guardrails.

. . .

5. Query – Response Caching

Last but not least, we can cache pairs of entire queries and responses. Intuitively, when we talk about caching, the first thing that comes to mind is caching query and response pairs. And this would be the ultimate jackpot for our RAG pipeline, as in this case, we don’t need to run any of it, and we can just provide a response to the user’s query solely by using the cache.

More specifically, in this case, we store entire query — final response pairs in the cache, and completely avoid any retrieval (in case of RAG) and re-generation of a response. In this way, instead of retrieving relevant chunks and generating a response from scratch, we directly get a precomputed response, which was generated at some earlier time for the same or a similar query.

To safely implement query-response caching, we either have to use exact matches in the form of a key-value cache or use semantic caching with a very strict threshold (like 0.99 cosine similarity between user query and cached query).

So, our RAG pipeline with query-response caching would look something like this:

Continuing with our Athens example, suppose a user asks the query:

what area codes correspond to athens greece?

Assume that earlier, the system already processed this query through the full RAG pipeline, retrieving relevant chunks, reranking them, assembling the prompt, and generating the final answer with the LLM. The generated response might look something like:

The main telephone area code for Athens, Greece is 21. 
Numbers in the Athens metropolitan area typically start with the prefix 210, 
followed by the local subscriber number.

The next time an identical or extremely similar query appears, the system does not need to run the retrieval, reranking, or generation steps again. Instead, it can immediately return the cached response.

In general, for the query-response cache, the key values have the following format:

query → final_response

. . .

On my mind

Apart from Prompt Caching directly provided in the API services of the various LLMs, several other caching mechanisms can be utilized in an RAG application to achieve cost and latency savings. More specifically, we can utilize caching mechanisms in the form of query embeddings cache, retrieval cache, reranking cache, prompt assembly cache, and query response cache. In practice, in a real-world RAG, many or all of these cache stores can be used in combination to provide improved performance in terms of cost and time as the users of the app scale.


Loved this post? Let’s be friends! Join me on:

📰Substack 💌 Medium 💼LinkedIn Buy me a coffee!

All images by the author, except mentioned otherwise.

Share this Article
Please enter CoinGecko Free Api Key to get this plugin works.