OpenAI Prompt Cache Monitoring. A worked example using Python and the… | by Thomas Reid

OpenAI Prompt Cache Monitoring. A worked example using Python and the… | by Thomas Reid | Dec, 2024

Last updated: 2024/12/11 at 9:09 AM

Editor AI News

1 Min Read

Contents

A worked example using Python and the chat completion API Install WSL2 Ubuntu

A worked example using Python and the chat completion API

As part of their recent DEV Day presentation, OpenAI announced that Prompt Caching was now available for various models. At the time of writing, those models were:-

GPT-4o, GPT-4o mini, o1-preview and o1-mini, as well as fine-tuned versions of those models.

This news shouldn’t be underestimated, as it will allow developers to save on costs and reduce application runtime latency.

API calls to supported models will automatically benefit from Prompt Caching on prompts longer than 1,024 tokens. The API caches the longest prefix of a prompt that has been previously computed, starting at 1,024 tokens and increasing in 128-token increments. If you reuse prompts with common prefixes, OpenAI will automatically apply the Prompt Caching discount without requiring you to change your API integration.

As an OpenAI API developer, the only thing you may have to worry about is how to monitor your Prompt Caching use, i.e. check that it’s being applied.

In this article, I’ll show you how to do that using Python, a Jupyter Notebook and a chat completion example.