Understanding Context and Contextual Retrieval in RAG

Contents

What about context?What about contextual retrieval?Reducing cost with prompt caching On my mind

In my latest post, I how hybrid search can be utilised to significantly improve the effectiveness of a RAG pipeline. RAG, in its basic version, using just semantic search on embeddings, can be very effective, allowing us to utilise the power of AI in our own documents. Nonetheless, semantic search, as powerful as it is, when utilised in large knowledge bases, can sometimes miss exact matches of the user’s query, even if they exist in the documents. This weakness of traditional RAG can be dealt with by adding a keyword search component in the pipeline, like BM25. In this way, hybrid search, combining semantic and keyword search, leads to much more comprehensive results and significantly improves the performance of a RAG system.

Be that as it may, even when using RAG with hybrid search, we can still sometimes miss important information that is scattered in different parts of the document. This can happen because when a document is broken down into text chunks, sometimes the context — that is, the surrounding text of the chunk that forms part of its meaning — is lost. This can especially happen for text that is complex, with meaning that is interconnected and scattered across several pages, and inevitably cannot be wholly included within a single chunk. Think, for example, referencing a table or an image across several different text sections without explicitly defining to which table we are refering to (e.g., “as shown in the Table, profits increased by 6%” — which table?). As a result, when the text chunks are then retrieved, they are stripped down of their context, sometimes resulting in the retrieval of irrelevant chunks and generation of irrelevant responses.

This loss of context was a major issue for RAG systems for some time, and several not-so-successful solutions have been explored for improving it. An obvious attempt for improving this, is increasing chunk size, but this often also alters the semantic meaning of each chunk and ends up making retrieval less precise. Another approach is increasing chunk overlap. While this helps to increase the preservation of context, it also increases storage and computation costs. Most importantly, it doesn’t fully solve the problem — we can still have important interconnections to the chunk out of chunk boundaries. More advanced approaches attempting to solve this challenge include Hypothetical Document Embeddings (HyDE) or Document Summary Index. Nonetheless, those still fail to provide substantial improvements.

Ultimately, an approach that effectively resolves this and significantly enhances the outcomes of a RAG system is contextual retrieval, originally introduced by Anthropic in 2024. Contextual retrieval aims to resolve the loss of context by preserving the context of the chunks and, therefore, improving the accuracy of the retrieval step of the RAG pipeline.

. . .

What about context?

Before saying anything about contextual retrieval, let’s take a step back and talk a little bit about what context is. Sure, we’ve all heard about the context of LLMs or context windows, but what are those about, really?

To be very precise, context refers to all the tokens that are available to the LLM and based on which it predicts the next word — remember, LLMs work by generating text by predicting it one word at a time. Thus, that will be the user prompt, the system prompt, instructions, skills, or any other guidelines influencing how the model produces a response. Importantly, the part of the final response the model has produced so far is also part of the context, since each new token is generated based on everything that came before it.

Apparently, different contexts lead to very different model outputs. For example:

‘I went to a restaurant and ordered a‘ could output ‘pizza.‘
‘I went to the pharmacy and bought a‘ could output ‘medicine.‘

A fundamental limitation of LLMs is their context window. The context window of an LLM is the maximum number of tokens that can be passed at once as input to the model and be taken into account to produce a single response. There are LLMs with larger or smaller context windows. Modern frontier models can handle hundreds of thousands of tokens in a single request, whereas earlier models often had context windows as small as 8k tokens.

In a perfect world, we would want to just pass all the information that the LLM needs to know in the context, and we would most likely get very good answers. And this is true to some extent — a frontier model like Opus 4.6 with a 200k token context window corresponds to about 500-600 pages of text. If all the information we need to provide fits this size limit, we can indeed just include everything as is, as an input to the LLM and get a great answer.

The issue is that for most of real-world AI use cases, we need to utilize some kind of knowledge base with a size that is much beyond this threshold — think, for instance, legal libraries or manuals of technical equipment. Since models have these context window limitations, we unfortunately cannot just pass everything to the LLM and let it magically respond — we have to somwhow pick what is the most important information that should be included in our limited context window. And that is essentially what the RAG methodology is all about — picking the appropriate information from a large knowledge base so as to effectively answer a user’s query. Ultimately, this emerges as an optimization/ engineering problem — context engineering — identifying the appropriate information to include in a limited context window, so as to produce the best possible responses.

This is the most crucial part of a RAG system — making sure the appropriate information is retrieved and passed over as input to the LLM. This can be done with semantic search and keyword search, as already explained. Nevertheless, even when bringing all semantically relevant chunks and all exact matches, there’s still a good chance that some important information may be left behind.

But what kind of information would this be? Since we have covered the meaning with semantic search and the exact matches with keyword search, what other type of information is there to consider?

Different documents with inherently different meanings may include parts that are similar or even identical. Imagine a recipe book and a chemical processing manual both instructing the reader to ‘Heat the mixture slowly’. The semantic meaning of such a text chunk and the actual words are very similar — identical. In this example, what forms the meaning of the text and allow us to separate between cooking and chemnical engineering is what we are reffering to as context.

Thus, this is the kind of extra information we aim to preserve. And this is exactly what contextual retrieval does: preserves the context — the surrounding meaning — of each text chunk.

. . .

What about contextual retrieval?

So, contextual retrieval is a methodology applied in RAG aiming to preserve the context of each chunk. In this way, when a chunk is retrieved and passed over to the LLM as input, we are able to preserve as much of its initial meaning as possible — the semantics, the keywords, the context — all of it.

To achieve this, contextual retrieval suggests that we first generate a helper text for each chunk — namely, the contextual text — that allows us to situate the text chunk in the original document it comes from. In practice, we ask an LLM to generate this contextual text for each chunk. To do this, we provide the document, along with the actual chunk, in a single request to an LLM and prompt it to “provide the context to situate the specific chunk in the document“. A prompt for generating the contextual text for our Italian Cookbook chunk would look something like this:

<document> 
the entire document Italian Cookbook document the chunk comes from
</document> 

Here is the chunk we want to place within the context of the full document.

<chunk> 
the actual chunk
</chunk> 

Provide a brief context that situates this chunk within the overall 
document to improve search retrieval. Respond only with the concise 
context and nothing else.

The LLM returns the contextual text which we combine with our initial text chunk. In this way, for each chunk of our initial text, we generate a contextual text that describes how this specific chunk is placed in its parent document. For our example, this would be something like:

Context: Recipe step for simmering homemade tomato pasta sauce.
Chunk: Heat the mixture slowly and stir occasionally to prevent it from sticking.

Which is indeed a lot more informative and specific! Now there is no doubt about what this mysterious mixture is, because all the information needed for identiying whether we are talking about tomato sauce or laboratory starch solutions is conveniently included within the same chunk.

From this point on, we deal with the initial chunk text and the contextual text as an unbreakable pair. Then, the rest of the steps of RAG with hybrid search are performed essentially in the same way. That is, we create embeddings that are stored in a vector search and the BM25 index for each text chunk, prepended with its contextual text.

This approach, as simple as it is, results in astonishing improvements in the retrieval performance of RAG pipelines. According to Anthropic, Contextual Retrieval improves the retrieval accuracy by an impressive 35%.

. . .

Reducing cost with prompt caching

I hear you asking, “But isn’t this going to cost a fortune?“. Surprisingly, no.

Intuitively, we understand that this setup is going to significantly increase the cost of ingestion for a RAG pipeline — essentially double it, if not more. After all we now added a bunch of extra calls to the LLM, didn’t we? This is true to some extent — indeed now, for each chunk, we make an additional call to the LLM in order to situate it within its source document and get the contextual text.

However, this is a cost that we are only paying once, at the stage of document ingestion. Unlike alternative techniques that attempt to preserve context at runtime — such as Hypothetical Document Embeddings (HyDE) — contextual retrieval performs the heavy work during the document ingestion stage. In runtime approaches, additional LLM calls are required for every user query, which can quickly scale latency and operational costs. In contrast, contextual retrieval shifts the computation to the ingestion phase, meaning that the improved retrieval quality comes with no additional overhead during runtime. On top of these, additional techniques can be used for further reducing the contextual retrieval cost. More precisely, caching can be used for generating the summary of the document only once and then situating each chunk against the produced document summary.

. . .

On my mind

Contextual retrieval represents a simple yet powerful improvement to traditional RAG systems. By enriching each chunk with contextual text, pinpointing its semantic position within its source document, we dramatically reduce the ambiguity of each chunk, and thus improve the quality of the information passed to the LLM. Combined with hybrid search, this technique allows us to preserve semantics, keywords, and context simultaneously.

Loved this post? Let’s be friends! Join me on:

📰Substack 💌 Medium 💼LinkedIn ☕Buy me a coffee!

All images by the author, except mentioned otherwise.