OpenAI Prompt Cache Monitoring. A worked example using Python and the… | by Thomas Reid | Dec, 2024

Editor
1 Min Read


A worked example using Python and the chat completion API

As part of their recent DEV Day presentation, OpenAI announced that Prompt Caching was now available for various models. At the time of writing, those models were:-

GPT-4o, GPT-4o mini, o1-preview and o1-mini, as well as fine-tuned versions of those models.

This news shouldn’t be underestimated, as it will allow developers to save on costs and reduce application runtime latency.

API calls to supported models will automatically benefit from Prompt Caching on prompts longer than 1,024 tokens. The API caches the longest prefix of a prompt that has been previously computed, starting at 1,024 tokens and increasing in 128-token increments. If you reuse prompts with common prefixes, OpenAI will automatically apply the Prompt Caching discount without requiring you to change your API integration.

As an OpenAI API developer, the only thing you may have to worry about is how to monitor your Prompt Caching use, i.e. check that it’s being applied.

In this article, I’ll show you how to do that using Python, a Jupyter Notebook and a chat completion example.

Install WSL2 Ubuntu

Share this Article
Please enter CoinGecko Free Api Key to get this plugin works.