Why Care About Prompt Caching in LLMs?

Editor
17 Min Read


, we’ve talked a lot about what an incredible tool RAG is for leveraging the power of AI on custom data. But, whether we are talking about plain LLM API requests, RAG applications, or more complex AI agents, there is one common question that remains the same. How do all these things scale? In particular, what happens with cost and latency as the number of requests in such apps grows? Especially for more advanced AI agents, which may contain multiple calls to an LLM for processing a single user query, these questions become of particular importance.

Fortunately, in reality, when making calls to an LLM, the same input tokens are usually repeated across multiple requests. Users are going to ask some specific questions much more than others, system prompts and instructions integrated in AI-powered applications are repeated in every user query, and even for a single prompt, models perform recursive calculations to generate an entire response (remember how LLMs produce text by predicting words one by one?). Similar to other applications, the use of the caching concept can significantly help optimize LLM request costs and latency. For instance, according to OpenAI documentation, Prompt Caching can reduce latency by up to an impressive 80% and input token costs by up to 90%.


What about caching?

In general, caching in computing is no new idea. At its core, a cache is a component that stores data temporarily so that future requests for the same data can be served faster. In this way, we can distinguish between two basic cache states – a cache hit and a cache miss. In particular:

  • A cache hit occurs when the requested data is found in the cache, allowing for a quick and cheap retrieval.
  • A cache miss occurs when the data is not in the cache, forcing the application to access the original source, which is more expensive and time-consuming.

One of the most typical implementations of cache is in web browsers. When visiting a website for the first time, the browser checks for the URL in its cache memory, but finds nothing (that will be a cache miss). Since the data we are looking for isn’t locally available, the browser has to perform a more expensive and time-consuming request to the web server across the internet, in order to find the data in the remote server where they originally exist. Once the page finally loads, the browser typically copies that data into its local cache. If we try to reload the same page 5 minutes later, the browser will look for it in its local storage. This time, it will find it (a cache hit) and load it from there, without reaching back to the server. This makes the browser work more quickly and consume fewer resources.

As you may imagine, caching is particularly useful in systems where the same data is requested multiple times. In most systems, data access is rarely uniform, but rather tends to follow a distribution where a small fraction of the data accounts for the vast majority of requests. A large portion of real-life applications follows the Pareto principle, meaning that about of 80% of the requests are about 20% of the data. If not for the Pareto principle, cache memory would need to be as large as the primary memory of the system, rendering it very, very expensive.


Prompt Caching and a Little Bit about LLM Inference

The caching concept – storing frequently used data somewhere and retrieving it from there, instead of obtaining it again from its primary source – is utilized in a similar manner for improving the efficiency of LLM calls, allowing for significantly reduced costs and latency. Caching can be utilised in various elements that may be involved in an AI application, most important of which is Prompt Caching. Nevertheless, caching can also provide great benefits by being applied to other aspects of an AI app, such as, for instance, caching in RAG retrieval or query-response caching. Nonetheless, this post is going to solely focus on Prompt Caching.


To understand how Prompt Caching works, we must first understand a little bit about how LLM inference – using a trained LLM to generate text – functions. LLM inference is not a single continuous process, but is rather divided into two distinct stages. Those are:

  • Pre-fill, which refers to processing the entire prompt at once to produce the first token. This stage requires heavy computation, and it is thus compute-bound. We may picture a very simplified version of this stage as each token attending to all other tokens, or something like comparing every token with every previous token.
  • Decoding, which appends the last generated token back into the sequence and generates the next one auto-regressively. This stage is memory-bound, as the system must load the entire context of previous tokens from memory to generate every single new token.

For example, imagine we have the following prompt:

What should I cook for dinner? 

From which we may then get the first token:

Here

and the following decoding iterations:

Here 
Here are 
Here are 5 
Here are 5 easy 
Here are 5 easy dinner 
Here are 5 easy dinner ideas

The issue with this is that in order to generate the complete response, the model would have to process the same previous tokens over and over again to produce each next word during the decoding stage, which, as you may imagine, is highly inefficient. In our example, this means that the model would process again the tokens ‘What should I cook for dinner? Here are 5 easy‘ for producing the output ‘ideas‘, even if it has already processed the tokens ‘What should I cook for dinner? Here are 5′ some milliseconds ago.

To solve this, KV (Key-Value) Caching is used in LLMs. This means that intermediate Key and Value tensors for the input prompt and previously generated tokens are calculated once and then stored at the KV cache, instead of recomputing from scratch at each iteration. This results in the model performing the minimum needed calculations for producing each response. In other words, for each decoding iteration, the model only performs calculations to predict the newest token and then appends it to the KV cache.

Nonetheless, KV caching only works for a single prompt and for generating a single response. Prompt Caching extends the principles used in KV caching for utilizing caching across different prompts, users, and sessions.


In practice, with prompt caching, we save the repeated parts of a prompt after the first time it is requested. These repeated parts of a prompt usually have the form of large prefixes, like system prompts, instructions, or retrieved context. In this way, when a new request contains the same prefix, the model uses the computations made previously instead of recalculating from scratch. This is incredibly convenient since it can significantly reduce the operating costs of an AI application (we don’t have to pay for repeated inputs that contain the same tokens), as well as reduce latency (we don’t have to wait for the model to process tokens that have already been processed). This is especially useful in applications where prompts contain large repeated instructions, such as RAG pipelines.

It is important to understand that this caching operates at the token level. In practice, this means that even if two prompts differ at the end, as long as they share the same token prefix, the cached computations for that shared portion can still be reused, and only perform new calculations for the tokens that differ. The tricky part here is that the common tokens have to be at the start of the prompt, so how we form our prompts and instructions becomes of particular importance. In our cooking example, we can imagine the following consecutive prompts.

Prompt 1
What should I cook for dinner? 

and then if we input the prompt:

Prompt 2
What should I cook for launch? 

The shared tokens ‘What should I cook’ should be a cache hit, and thus one should expect to consume significantly reduced tokens for Prompt 2.

Nonetheless, if we had the following prompts…

Prompt 1
Dinner time! What should I cook? 

and then

Prompt 2
Launch time! What should I cook? 

This would be a cache miss, since the first token of each prompt is different. Since the prompt prefixes are different, we cannot hit cache, even if their semantics are essentially the same.

As a result, a basic rule of thumb on getting prompt caching to work is to always append any static information, like instructions or system prompts, at the start of the model input. On the flip side, any typically variable information like timestamps or user identifications should go at the end of the prompt.


Getting our hands dirty with the OpenAI API

Nowadays, most of the frontier foundation models, like GPT or Claude, provide some kind of Prompt Caching functionality directly integrated into their APIs. More specifically, in the mentioned APIs, Prompt Caching is shared among all users of an organization accessing the same API key. In other words, once a user makes a request and its prefix is stored in cache, for any other user inputting a prompt with the same prefix, we get a cache hit. That is, we get to use precomputed calculations, which significantly reduce the token consumption and make the response generation faster. This is particularly useful when deploying AI applications in the enterprise, where we expect many users to use the same application, and thus the same prefixes of inputs.

On most recent models, Prompt Caching is automatically activated by default, but some level of parametrization is available. We can distinguish between:

  • In-memory prompt cache retention, where the cached prefixes are maintained for like 5-10 minutes and up to 1 hour, and
  • Extended prompt cache retention (only available for specific models), allowing for a longer retention of the cached prefix, up to a maximum of 24 hours.

But let’s take a closer look!

We can see all these in practice with the following minimal Python example, making requests to the OpenAI API, using Prompt Caching, and the cooking prompts mentioned earlier. I added a rather large shared prefix to my prompts, so as to make the effects of caching more visible:

from openai import OpenAI
api_key = "your_api_key"
client = OpenAI(api_key=api_key)

prefix = """
You are a helpful cooking assistant.

Your task is to suggest simple, practical dinner ideas for busy people.
Follow these guidelines carefully when generating suggestions:

General cooking rules:
- Meals should take less than 30 minutes to prepare.
- Ingredients should be easy to find in a regular supermarket.
- Recipes should avoid overly complex techniques.
- Prefer balanced meals including vegetables, protein, and carbohydrates.

Formatting rules:
- Always return a numbered list.
- Provide 5 suggestions.
- Each suggestion should include a short explanation.

Ingredient guidelines:
- Prefer seasonal vegetables.
- Avoid exotic ingredients.
- Assume the user has basic pantry staples such as olive oil, salt, pepper, garlic, onions, and pasta.

Cooking philosophy:
- Favor simple home cooking.
- Avoid restaurant-level complexity.
- Focus on meals that people realistically cook on weeknights.

Example meal styles:
- pasta dishes
- rice bowls
- stir fry
- roasted vegetables with protein
- simple soups
- wraps and sandwiches
- sheet pan meals

Diet considerations:
- Default to healthy meals.
- Avoid deep frying.
- Prefer balanced macronutrients.

Additional instructions:
- Keep explanations concise.
- Avoid repeating the same ingredients in every suggestion.
- Provide variety across the meal suggestions.

""" * 80   
# huge prefix to make sure i get the 1000 something token threshold for activating prompt caching

prompt1 = prefix + "What should I cook for dinner?"

and then for the prompt 2

prompt2 = prefix + "What should I cook for lunch?"

response2 = client.responses.create(
    model="gpt-5.2",
    input=prompt2
)

print("\nResponse 2:")
print(response2.output_text)

print("\nUsage stats:")
print(response2.usage)

So, for prompt 2, we would be only billed the remaining, non-identical part of the prompt. That would be the input tokens minus the cached tokens: 20,014 – 19,840 = only 174 tokens, or in other words, 99% less tokens.

In any case, since OpenAI imposes a 1,024 token minimum threshold for activating prompt caching and the cache can be preserved for a maximum of 24 hours, it becomes clear that those cost benefits can be obtained in practice only when running AI applications at scale, with many active users performing many requests daily. Nonetheless, as explained for such cases, the Prompt Caching feature can provide substantial cost and time benefits for LLM-powered applications.


On my mind

Prompt Caching is a powerful optimization for LLMs that can significantly improve the efficiency of AI applications both in terms of cost and time. By reusing previous computations for identical prompt prefixes, the model can skip redundant calculations and avoid repeatedly processing the same input tokens. The result is faster responses and lower costs, especially in applications where large parts of prompts—such as system instructions or retrieved context—remain constant across many requests. As AI systems scale and the number of LLM calls increases, these optimizations become increasingly important.


Loved this post? Let’s be friends! Join me on:

📰Substack 💌 Medium 💼LinkedIn Buy me a coffee!

All images by the author, except mentioned otherwise.

Share this Article
Please enter CoinGecko Free Api Key to get this plugin works.