locally sounds straightforward. Download the weights, start the server, and send requests. That works for a chatbot, but it doesn’t automatically work for an agent. In my case, I’ve been building an agent for automated single-cell RNA-seq analysis. The idea is that, given raw data, the agent can run the full pipeline on its own, deciding which tools to call, reading the results, and working through the analysis step by step.
You might ask why not just use something like Claude Code with a single-cell analysis Skill. The short answer is that for scientific workflows, that’s not quite enough. Skills are ultimately prompts and can thus be overridden or ignored. More importantly, scientific work requires reproducibility and provenance tracking: knowing exactly which parameters were used, which cells were filtered, which clustering resolution produced which result, etc. That record needs to be structured and persistent, not reconstructed from a conversation. For long-running sessions, you also need explicit world state management rather than relying on context compaction to preserve what matters. These are things you have to build deliberately. Building all of these on top of a local model also means you own the infrastructure, and that’s what I’m going to be focusing on here.
The agent we built runs on institutional HPC hardware using recent open-weight models. It is easy to assume open-weight models are not strong enough for this kind of work. But that is becoming less true. Recent releases like Qwen3.6–27B and Gemma 4–31B are genuinely useful for structured, tool-driven workloads (If you’re interested in keeping up with how open source is evolving, Interconnects AI has interesting stuff you can follow). And that’s one of the main reasons why local hosting makes sense here. Our agent also supports cloud APIs like Claude and GPT, but when you use those, all of the infrastructure I’m about to describe is invisible to you. Someone else has already solved it. When you host the model yourself, those problems become yours.
When I ran the model the first time, it worked in a narrow sense. The model would call tools, the tools would run, and the analysis would move forward. But it wasn’t really usable yet. A simple single-cell analysis could have 50–80 tool calls in a loop. Every call carried the same fixed baggage: the system prompt, the tool schemas, and the growing conversation history. For this agent, the system prompt and tool schemas alone were about 36k tokens. Before the model could decide anything, it first had to read tens of thousands of tokens of instructions and tool definitions. Then it had to do that again on the next iteration. And again on the one after that. Each iteration took 10 to 15 seconds. And a long session would eventually crash out with context overflow errors, taking all the in-memory analysis state with it. This article is about fixing both of those problems.
The first part covers making inference faster through a set of compounding optimizations to the vLLM inference server (an open-source inference engine built for high-throughput LLM serving). The second part covers keeping long sessions alive through better context management and a structured world state that survives trimming. I ran experiments on A100 and H100 GPUs to measure the impact of each change, and those are described below.
Part 1: Making Inference Fast
Before getting into the individual optimizations, it helps to understand what’s actually happening on each iteration of the agent loop. The diagram below shows a single iteration: the agent sends a request containing the system prompt, tool schemas, and the full conversation history to the model. The model reads all of it and decides which tools to call. The tool runs and returns a result, and that result gets appended to the history before the next iteration begins. Two things are worth noting here. The fixed prefix, which is the system prompt plus tool schemas, is roughly 36k tokens and gets sent on every single call. And the conversation history grows with every iteration. By iteration 40, the model is no longer reading a short instruction. It’s reading a long analysis transcript with many tool calls, tool outputs, intermediate results, etc. Both of these things affect the performance of the agent.
1.1 CUDA Graphs: Reducing Hundreds of Instructions Per Token to One
To understand this one, it helps to know what happens inside a GPU when it generates a single token.
Generating a single token during the decode phase involves executing a sequence of GPU kernels in order: attention, feed-forward, normalization, and so on. Each of these kernel launchers has a small coordination cost on the CPU side. The CPU has to queue an instruction telling the GPU exactly which kernel to run, with which tensor shapes and memory pointers. For a 27-billion-parameter model we’ve been working with, this means hundreds of individual dispatches per token. Each one is small, but they add up.
CUDA graphs eliminate this overhead. Before handling any real requests, vLLM can run a warmup pass where it records all the kernel dispatches for the decode step into one single replayable object. After that, generating each token is one instruction to the GPU instead of hundreds. The result is roughly 20–25% lower latency from this single change, with no change to the model itself.
This is great, but CUDA graphs also require static tensor shapes, meaning the graph is compiled for a specific batch size and sequence length. What this means for us is that the first startup takes longer than subsequent ones. Subsequent startups are much faster. As shown in Figure 8 below, for an agent running hundreds of iterations, the cumulative effect is a lot.
1.2 Fitting More in Memory
Every weight in a neural network is a number, and the format you use to store that number affects both how much memory it takes and how fast the GPU can work with it. The standard format for modern LLMs (at least for training) is BF16, which stores each weight as a 16-bit floating number. For Qwen3.6–27B’s 27 billion parameters, that’s roughly 56GB of weight data just to load the model.
FP8 stores each weight in one byte instead of two. The same model now fits in around 31GB. We can now use that freed memory for the KV cache, which is what stores the conversation context. More KV cache means the model can handle longer inputs before running out of room. But the memory we free for the KV cache is not used the same way by every model. How much actual context we get from that memory depends on the model architecture itself. A useful number here is KV memory per token, which is just basically how much GPU memory the model needs to store one token of context.
This is why two models with similar parameter counts can behave differently in practice. For example, Gemma 4–31B uses roughly 1.1MB of KV cache per token. Qwen3.6–27B, depending on how you count its attention layers, can be closer to 256KB per token in a conservative estimate. That means the same amount of leftover GPU buys you many more tokens of context on Qwen than on Gemma.
For example, suppose after loading the model on two 80GB GPUs and leaving some runtime overhead, we have around 82GB available for the KV cache. With Gemma, we get 82GB/1.1MB ≈ 74k tokens. With Qwen, if we use 256KB per token, 82GB/256KB ≈ 320k tokens. In practice, the model’s configured maximum context length caps this around 262k tokens (well, actually, if you use YaRN, it can be extended to 1M tokens), but the point is the same. Qwen can use the same GPU memory much more efficiently for long context workloads.
Going back to working with FP8 for our model weights, actually multiplying FP8 numbers together using hardware tensor cores requires dedicated FP8 arithmetic units, which NVIDIA introduced in the Hopper generation. The H100 has them. The A100 does not. So on A100 we use BF16 weights, and on H100 we use FP8 weights. The speed benefit follows directly from this. During decode, the GPU has to read the model weights from memory on every single token it generates. At batch size 1, which is what a single-user agent session looks like, that memory reading is the bottleneck, not the computation itself. Smaller weights mean less data to read per token, which means faster generation.
Apart from the model weights themselves, there is a second place where FP8 helps, and this one works on both GPUs. Storing KV cache vectors in FP8 instead of BF16 halves the per-token cost (for Qwen3.6–27B, it goes from 256KB to 128KB), directly doubling how many tokens fit in memory. The vectors are stored in FP8 for memory efficiency but dequantized back to BF16 when actually used in attention computation, so this does not require FP8 tensor cores.

There is one more thing that compounds the memory benefit. Running the model across multiple GPUs with tensor parallelism splits the weight matrices across both cards. Each GPU now holds half the weights, which frees up even more room per GPU for KV cache. On A100, this takes the per-GPU weight footprint from 56GB down to 28GB, leaving 44GB per GPU for KV cache instead of 16GB. That translates to a context window of around 180K tokens on A100 hardware, which is enough for a full analysis session to run without overflow (Tensor parallelism is different from FSDP, which is another method of distributing workload. You can read my other article here to learn more).

1.3 Prefix Caching
Remember that fixed prefix we mentioned earlier: the system prompt and tool schemas that get sent on every single agent loop iteration. For this agent, that’s roughly 36K tokens. On every iteration, before the model can decide anything, it first has to read and process all of those tokens from scratch. That means computing the full attention over 36K tokens on every call, even though nothing in that prefix has changed since the last call.
Prefix caching solves this by storing the key and value vectors for any token sequence the model has already processed. If the next request starts with the same prefix, those vectors are retrieved directly from cache rather than recomputed. The model only pays the full prefill cost on the very first request. Every subsequent request in the same session skips straight to the new tokens at the tail. But if something changes in that prefix mid-way through the session, the entire history needs to be read from scratch. For example, if you edit the system prompt or edit the tool list by adding MCP tools, for instance, the whole thing needs to be re-read. The same is true if you change the model mid-session. You might have seen this in Claude Code where it tells you that the entire message needs to be re-read if you try to change the model mid-session.
For an agent loop, this is particularly valuable because the fixed portion is large and the new portion added each iteration is comparatively small. As the session progresses, the cache hit rate actually improves. By iteration 40, most of the request is cached history and only the newest additions need fresh computation.

To measure the actual impact, we ran the agent’s real system prompt and tool schemas, the full 36K tokens, through the vLLM server with and without prefix caching enabled, across both A100 and H100 hardware. We measured time to first token on the cold start, which is the first request where the cache is empty and the full prefix has to be computed from scratch, and on subsequent warm requests where the cache is already populated. On A100, cold start Time to First Token (TTFT) was 11,470ms. With a warm cache, that dropped to 706ms. On H100, the cold start was 2,655ms, dropping to 249ms on warm start. That’s because the prefix is not being recomputed. Only the new tokens at the tail are processed.

1.4 Speculative Decoding
Decoding is inherently sequential. The model generates one token, appends it to the context, then generates the next in an autoregressive manner. Each token depends on all the ones before it, so you cannot parallelize across tokens the way you can parallelize across a batch. For a single-user agent session at batch size 1, this sequential bottleneck is the main constraint on throughput.
Speculative decoding gets around this by introducing a small draft model that runs ahead of the main model. The draft model proposes the next k tokens cheaply and quickly. The main model then verifies all k proposed tokens in a single parallel forward pass. Because the main model is reading k tokens simultaneously rather than generating them one by one, the verification step costs roughly the same as generating a single token normally. If most of the proposals are accepted, you get k tokens for nearly the price of one.

The key variable is the acceptance rate. If the draft model’s proposals are consistently wrong, the main model rejects them and falls back to generating one token at a time, but you still paid the overhead of running the draft path. The exact breakeven point depends on the draft model, the number of proposed tokens, the hardware, and the serving implementation. In our setup, acceptance below roughly 40% was not worth it.
This is where the choice of draft model matters a lot. We initially tried DFlash, a separate small model used as a draft. The acceptance rate on our workload was 4 to 7%, well below breakeven. It actually made things slower. (To be fair, as of the day I’m writing this article, the creators of DFlash over at Z lab said that the draft model from Qwen3.6–27B was still under training, so it might be better once that is done). But for our case, Qwen3.6–27B has something better built in: a Multi-Token Prediction head (MTP), which is an auxiliary prediction head trained alongside the main model and baked directly into the weights. Because the MTP head is trained alongside the main model and uses the model’s own hidden states, its proposals are much better aligned with what the model would have generated anyway.
We measured MTP acceptance rates across real agent sessions on both A100 and H100. At a median ~89% acceptance, MTP was safely on the useful side of the tradeoff. Well above breakeven, and stable across different parts of the analysis workflow.

Putting It All Together
Each of these optimizations was benchmarked cumulatively, with every configuration building on all the previous ones. The full stack was measured across both A100 and H100 hardware on the Qwen model we’ve been using with our real 36K token system prompt, matching the actual conditions of an agent session.

A few things stand out. CUDA graphs are the dominant decode gain, giving roughly 3x on A100 and 6x on H100. The H100 baseline actually starts slower than A100, which is counterintuitive (actually, one thing to note here is that the communication protocol between your GPUs matters a lot, apart from just the GPU type. The gold standard is using NVLink via NVSwitch followed by NVLink Bridge, and then PCIe). It reflects how severely CPU dispatch overhead limits FP8 kernels before graphs are compiled. Once graphs are enabled, though, H100 pulls ahead and stays there.
FP8 KV cache and prefix caching are flat on decode throughput, which is expected. They address memory capacity and prefill latency respectively, not token generation speed. The prefix caching effect shows up clearly on the TTFT side of the waterfall: a flat line across the first three configurations, then a sharp drop when caching is enabled.
MTP is the second largest decode contributor on both GPUs, adding around 37% on A100 and 20% on H100.
Part 2: Keeping Long Sessions Alive
When using cloud models, context management is easier to ignore. The context windows are often large enough for ordinary chat sessions, and the serving infrastructure is handled for you. When you run a local model, the context window becomes a hardware budget. More context means more KV cache. More KV cache means more GPU memory. On one of our earlier A100 configurations, the effective context window was around 74K tokens.
A single-cell analysis can run 50 to 80+ iterations. Each iteration appends tool calls, tool results, intermediate observations, plots, errors, corrections, and user constraints back into the conversation history. When it fills without any management, the API returns a context length exceeded error and the session dies, taking all the in-memory analysis state with it, including the AnnData object holding the processed dataset.
So the problem was not just making the model fast. The agent also needed to survive long enough to finish.
Context management sounds simple. Track how full the window is and trim it when it gets too full. In practice, though, there are a few places where naive implementations can go wrong.
Anthropic has a cookbook on context engineering for agents, and describes three strategies for long-horizon tasks: compaction, structured note-taking, and multi-agent architectures. Compaction is the most common solution, and for a general-purpose assistant, it works well. When the context fills up, the conversation history is passed back to the model to summarize, and the session continues with that compressed version.
For a general assistant, that can work. For scientific analysis, it loses the wrong things.
The problem for a scientific workflow is that a prose summary loses exactly the information you need. “The analysis clustered the data and ran quality control” is a valid summary, but it discards the QC thresholds, the clustering resolution, the number of cells retained, etc. Those exact parameters are what the agent needs to reproduce a step, describe its methodology, or correctly answer a question about what it did. Those are not cosmetic details. They are the analysis. A scientific agent needs the exact record, not just the gist.
Beyond the conceptual problem with compaction, there are more basic ways context management can fail. The first is fixed-cost accounting. Every API call includes the system prompt, tool schemas, and reserved completion budget before a single history message appears. For this agent, the system prompt and tool schemas alone are around 36K tokens. If the trim threshold does not subtract those fixed costs first, the agent can end up trimming against a budget that was already exceeded before any history was included.
Beyond the conceptual problem with compaction, there are more basic ways context management can fail. The first is fixed-cost accounting. Every API call includes the system prompt, tool schemas, and reserved completion budget before a single history message appears. For this agent, the system prompt and tool schemas alone are around 36K tokens. If the trim threshold does not subtract those fixed costs first, the agent can end up trimming against a budget that was already exceeded before any history was included.
The third is context-limit discovery. The context limit is the denominator of every budget calculation. If the model metadata query fails and the code silently falls back to a hardcoded default, every trim decision downstream is wrong.
The better approach is to stop treating the conversation history as the record of what happened. For a scientific workflow, you already have a more reliable record: the structured log of every step the agent took, with exact parameters and outcomes. We call this the world state.
The world state is a Python object that tracks the analysis as it progresses. Every tool call that completes writes a structured entry to it: which step ran, with what parameters, and what the outcomes were. This gets serialized into the system prompt on every iteration. It takes under 1,000 tokens, it contains the exact parameters rather than a prose summary of them, and it lives in the system prompt, which is never trimmed. When old tool results are removed from the message history, the analysis record survives intact.
This changes how you think about trimming. Instead of treating message history as something precious that has to be preserved because it contains the record of what happened, you can trim it aggressively because the record is elsewhere. The history becomes a useful context. The world state becomes the ground truth.

The world state handles the question of what to preserve. The remaining fixes handle the question of how to know when to trim and by how much.

The first fix was to stop treating the entire context window as available history. The available budget is computed by subtracting fixed costs upfront:
available = (
context_limit
- tool_schema_tokens
- system_tokens
- COMPLETION_RESERVE
- safety_margin
)
With a 262K context window and typical overhead, this leaves around 219K tokens for message history. With a tighter 32K context, it correctly reports only a few thousand tokens available. That is useful. It tells you immediately that long sessions won’t work at that context size, instead of silently crashing 3 iterations later.
The second fix was self-calibrating token counts. Rather than trying to match Qwen’s exact tokenization rules, we use the API’s own response to correct our estimates. After every call, the response includes the actual number of tokens processed. We compare that to our estimate and adjust the correct factor upward if the actual count was higher:
if actual_tokens > our_estimate:
calibration = max(calibration, actual_tokens / our_estimate)
calibration = min(calibration, 4.0)
The factor only goes up. An overestimate causes slightly more frequent trimming. An underestimate causes the next call to fail. The consequences are asymmetric, so the correction is one-directional.
The third fix was to trim strategically rather than uniformly. When approaching the budget, the agent collects eligible tool results, sorts them by size, and removes the largest ones first. A single large code execution output can contain 50 to 200KB of logs, tables, or base64-encoded plot data. Removing one large block can save as much context as removing dozens of small messages. User messages are never trimmed. They contain the scientific intent and constraints that define what the analysis is supposed to do.
Together, these changes made the agent actually usable. A full 50+ iteration analysis now runs to completion without the user seeing a context error.
Conclusion
Running an LLM locally for a real agentic workload exposes problems that are easy to miss when using a cloud API. The model does not just need to be good. The inference server needs to be configured deliberately, and the context window needs to be managed deliberately. Neither problem is impossible, but neither is automatic.
The optimizations in Part 1 compound in ways that are not obvious upfront. CUDA graphs take a barely functional baseline and make it usable. Prefix caching changes the interactive feel of the agent by avoiding the cost of rereading the same 36K-token prefix on every call. FP8 KV cache increases the amount of context that fits in memory. MTP adds meaningful decode throughput on top. Together, these changes take the agent from 10 to 15 seconds per iteration to roughly 1 to 3 seconds.
The context management changes in Part 2 solve a different problem: correctness. A long-running scientific agent needs to remember what it did, not just continue the conversation. The conversation history is a useful context, but it is a fragile source of truth. It grows, gets trimmed, and eventually has to be compressed or discarded. The world state approach, in particular, is something I would recommend to anyone building a domain-specific agent for scientific or analytical workflows. The durable record should live outside the transcript, in a structured state that records each step with exact parameters and outcomes.
That was the main lesson from building this system. A useful agent is not just an LLM with tools. It is a loop with infrastructure around it. The model decides what to do next, but the system around it determines whether the loop is fast enough, stable enough, and reliable enough to finish the work.
Thanks for reading, and I hope you found this helpful!