Context Engineering for AI Agents: A Deep Dive

Contents

Terminology Context engineering Context rot Context compaction Agent harness Communication between agents KV cache penalty Keep the agent’s toolset small and relevant Agentic memory To sum up

better models, larger context windows, and more capable agents. But most real-world failures don’t come from model capability — they come from how context is constructed, passed, and maintained.

This is a hard problem. The space is moving fast and techniques are still evolving. Much of it remains an experimental science and depends on the context (pun intended), constraints and environment you’re operating in.

In my work building multi-agent systems, a recurring pattern has emerged: performance is far less about how much context you give a model, and far more about how precisely you shape it.

This piece is an attempt to distill my learnings into something you can use.

It focuses on principles for managing context as a constrained resource — deciding what to include, what to exclude, and how to structure information so that agents remain coherent, efficient, and reliable over time.

Because at the end of the day, the strongest agents are not the ones that see the most. They are the ones that see the right things, in the right form, at the right time.

Terminology

Context engineering

Context engineering is the art of providing the right information, tools and format to an LLM for it to complete a task. Good context engineering means finding the smallest possible set of high signal tokens that give the LLM the highest probability of producing a good outcome.

In practice, good context engineering usually comes down to four moves. You offload information to external systems (context offloading) so the model does not need to carry everything in-band. You retrieve information dynamically instead of front-loading all of it (context retrieval). You isolate context so one subtask does not contaminate another (context isolation). And you reduce history when needed, but only in ways that preserve what the agent will still need later (context reduction).

A common failure mode on the other side is context pollution: the presence of too much unnecessary, conflicting or redundant information that it distracts the LLM.

Context rot

Context rot is a situation where an LLM’s performance degrades as the context window fills up, even if it is within the established limit. The LLM still has room to read more, but its reasoning starts to blur.

You would have noticed that the effective context window, where the model performs at high quality, is often much smaller than what the model technically is capable of.

There are two parts to this. First, a model does not maintain perfect recall across it’s entire context window. Information at the start and the end is more reliably recalled than things in the middle.

Second, larger context windows do not solve problems for enterprise systems. Enterprise data is effectively unbounded and frequently updated that even if the model could ingest everything, that would not mean it could maintain a coherent understanding over it.

Just like humans have a limited working memory capacity, every new token introduced to the LLM depletes this attention budget it has by some amount. The attention scarcity stems from architectural constraints in the transformer, where every token attends to every other token. This leads to a n² interaction pattern for n tokens. As the context grows, the model is forced to spread its attention thinner across more relationships.

Context compaction

Context compaction is the general answer to context rot.

When the model is nearing the limit of it’s context window, it summarises it’s contents and reinitiates a new context window with the previous summary. This is especially useful for long running tasks to allow the model to continue to work without too much performance degradation.

Recent work on context folding offers a different approach — agents actively manage their working context. An agent can branch off to handle a subtask and then fold it upon completion, collapsing the intermediate steps while retaining a concise summary of the outcome.

The difficulty, however, is not in summarising, but in deciding what survives. Some things should remain stable and nearly immutable, such as the objective of the task and hard constraints. Others can be safely discarded. The challenge is that the importance of information is often only revealed later.

Good compaction therefore needs to preserve facts that continue to constrain future actions: which approaches already failed, which files were created, which assumptions were invalidated, which handles can be revisited, and which uncertainties remain unresolved. Otherwise you get a neat, concise summary that reads well to a human and is useless to an agent.

Agent harness

A model is not an agent. The harness is what turns a model into one.

By harness, I mean everything around the model that decides how context is assembled and maintained: prompt serialization, tool routing, retry policies, the rules governing what is preserved between steps, and so on.

Drawn by author

Once you look at real agent systems this way, a lot of supposed “model failures” now look different. I’ve encountered many of such at work. These are actually harness failures: the agent forgot because nothing persisted the right state; it repeated work because the harness surfaced no durable artefact of prior failure; it chose the wrong tool because the harness overloaded the action space; and so on.

A good harness is, in some sense, a deterministic shell wrapped around a stochastic core. It makes the context legible, stable, and recoverable enough that the model can spend its limited reasoning budget on the task rather than on reconstructing its own state from a messy trace.

Communication between agents

As tasks get more complex, teams have defaulted towards multi-agent systems.

The mistake is to assume that more agents means more shared context. In practice, dumping a giant shared transcript into every sub-agent often creates exactly the opposite of specialisation. Now every agent is reading everything, inheriting everyone else’s mistakes, and paying the same context bill over and over again.

If only some context is shared, a new problem appears. What is considered authoritative when agents disagree? What remains local, and how are conflicts reconciled?

The way out is to treat communication not as shared memory, but as state transfer through well-defined interfaces.

For discrete tasks with clear inputs and outputs, agents should usually communicate through artefacts rather than raw traces. A web-search agent, for instance, does not need to pass along its entire browsing history. It only needs to surface the material that downstream agents can actually use.

This means that intermediate reasoning, failed attempts, and exploration traces stay private unless explicitly needed. What gets passed forward are distilled outputs: extracted facts, validated findings, or decisions that constrain the next step.

For more tightly coupled tasks, like a debugging agent where downstream reasoning genuinely depends on prior attempts, a limited form of trace sharing can be introduced. But this should be deliberate and scoped, not the default.

KV cache penalty

When AI models generate text, they often repeat many of the same calculations. KV caching is an inference time optimisation technique that speeds up this process by remembering important information from previous steps instead of recomputing everything again.

However, in multi-agent systems, if every agent shares the same context, you confuse the model with a ton of irrelevant details and pay a massive KV-cache penalty. Multiple agents working on the same task need to communicate with each other, but this should not be via sharing memory.

This is why agents should communicate through minimal, structured outputs in a controlled manner.

Keep the agent’s toolset small and relevant

Tool choice is a context problem disguised as a capability problem.

As an agent accumulates more tools, the action space gets harder to navigate. There is now a higher probability of the model going down the wrong action and taking an inefficient route.

This has consequences. Tool schemas need to be far more distinct than most people realise. Tools have to be well understood and have minimal overlap in functionality. It should be very clear on what their intended use is and have clear input parameters that are unambiguous.

One common failure mode that I noticed even in my team is that we tend to have very bloated sets of tools that are added over time. This leads to unclear decision making on which tools to use.

Agentic memory

This is a a technique where the agent regularly writes notes persisted to memory outside of the context window. These notes get pulled back into the context window at later times.

The hardest part is deciding what deserves promotion into memory. My rule of thumb is that durable memory should contain things that continue to constrain future reasoning: persistent preferences. Everything else should have a very high bar. Storing too much is just another route back to context pollution, only now you have made it persistent.

But memory without revision is a trap. Once agents persist notes across steps or sessions, they also need mechanisms for conflict resolution, deletion, and demotion. Otherwise long-term memory becomes a landfill of outdated beliefs.

To sum up

Context engineering is still evolving, and there is no single correct way to do it. Much of it remains empirical, shaped by the systems we build and the constraints we operate under.

Left unchecked, context grows, drifts, and eventually collapses under its own weight.

If well-managed, context becomes the difference between an agent that merely responds and one that can reason, adapt, and stay coherent across long and complex tasks.