Prefill Once, Fan Out: KV Snapshot Sharing for Multi-Agent LLM Pipelines

Editor
44 Min Read


A humorous-but-real tour of SwarmKV — KV-snapshot fan-out, copy-on-fork host buffers, and how to make a two-agent analytical pipeline ~1.95× faster (and the second branch’s activation latency 52× faster) by being mildly mean to llama.cpp.

of the “Production-Grade Agentic Inference” series. Each part removes one kind of redundant work from an agentic LLM pipeline. Part 1 (this post) kills redundant prefill. Part 2 tackles redundant waiting — how 50 micro-agents share one GPU through time-slicing. Part 3 keeps RAG retrieval on the GPU with a custom CUDA Top-K kernel. Part 4 persists agent state across hand-offs so the next agent never has the cold-start problem.

Key takeaways


The problem: when several agents read the same document, a default serving stack makes each one rerun the exact same prefill. That redundant dense attention pass is pure waste.

The fix: run prefill once, serialize the KV cache to a host buffer, memcpy it per branch, and restore it before decoding. “Compute once, fan out.”

The receipts: on a seven-year-old GTX 1080, a two-agent pipeline got 48.69% faster end to end (~1.95×) and the second agent’s activation latency dropped 98.09% (~52×), eliminating 8,685 ms of redundant compute.

The kicker: this is not a new algorithm. It is systems engineering — and it is the same “broadcast shared state once” decision a 5G cell tower has made every 80 ms since LTE.

TL;DR: Standard LLM serving makes every analytical agent re-prefill the same shared document. Your GPU dutifully re-executes billions of redundant prefix-prefill multiplications. The same bytes. The same weights. The same quantization. All to recalculate a state it already finished calculating four seconds ago. SwarmKV runs the prefill once, serializes the resulting KV state to a host buffer via llama_state_get_data, memcpys that buffer into a per-branch allocation, and lets each branch restore the snapshot with llama_state_seq_set_data before decoding from where the document left off. Yes, it is a real round-trip—serialize, copy, restore, but because redundant prefill compute scales quadratically while the KV state transfer scales linearly, moving the data across a constrained Pascal memory bus is still vastly cheaper than recalculating the attention matrices from scratch. That is reflected on the result on a seven-year-old GTX 1080: 48.69% end-to-end speedup on a two-agent pipeline, 98.09% reduction in branch-2 activation latency (~52×), 8,685 ms of redundant dense compute eliminated, zero new transformer tricks. Just the systems-engineering position that “compute once, fan out” beats “compute N times, hope nobody notices.”

Github Repo: https://github.com/AnubhabBanerjee/swarmkv

(Quick confession before we start: I came at this from a 5G/6G RAN engineering background. As it turns out, fanning out a shared computation to many downstream consumers is shockingly close to what a cell tower has been doing every 80 ms since LTE when it broadcasts SIB1. There’s a whole section on that below — section 8 — but it’s also why I’m writing this in the first place.)

Architecture mental model — keep this open while you read.

Document → PrefillNode → llama_state_get_data → host KV buffer → memcpy per branch → llama_state_seq_set_data → AnalyticalNode decode (RoPE continues at prefix_seq_len)

Everything below is just commentary on one part of that line.

SwarmKV architectural overview

1. A confession: most of your second agent’s “work” is a rerun

If you have ever pointed two analytical agents at the same document through vanilla llama.cpp, here is what really happens (with a little bit of intentional dramatization):

You: “Please provide an overview of this 3,500-token spec, and separately, list its license obligations.”

llama.cpp (Agent 1): “Sure. Loading model. Prefilling document. Decoding answer.”

GPU spends 4,346 ms on dense attention

llama.cpp (Agent 1): “Done. Here’s a 6-token answer.”

You: “Great. Now Agent 2.”

llama.cpp (Agent 2): “Sure. Loading model. Prefilling document — “

You: “Wait, you literally just did that.”

llama.cpp (Agent 2): “I’m an independent llama_context. I have no memory of Agent 1. I have no memory of anything. I am a beautiful, stateless newborn.” 🫡

GPU spends another 4,339 ms on bit-for-bit identical attention math.

Your GPU thermal sensor: gets a good workout;
Your AWS bill: develops a sense of humor;
Your second agent’s TTFT: 4.3 seconds before it can answer a 4-token question.

That is the joke. That is the dirty secret of every “agentic” pipeline that spreads out from a shared document. Each branch starts from a blank slate and rebuilds the same KV cache the previous branch just finished building. The deeper the document, the worse the tax. At 3,500 tokens on a Pascal GPU, the vast majority of the second agent’s perceived latency is not the answer — it is reading the document again.

SwarmKV is what happens when you decide the second reading is optional and you would rather write 1,500 lines of C++ than let each agent build the same KV cache over and over again.

Now imagine, the toy demo in this repo is about two agents over a summary/license-check. The real shape of the workload it is built for is N specialized evaluators over one dense technical document. Picture an AI patent and prior-art pipeline: one 50,000-token technical specification at the root, and fifty concurrent branches evaluating novelty, mapping claims, retrieving prior art, checking freedom-to-operate, assessing ethical compliance, and translating into jurisdiction-specific language. The baseline cost of that pipeline on a default serving stack is fifty full prefills of the same spec. The SwarmKV cost is one prefill plus fifty memcpys. That asymmetry is intentionally designed, and the entire reason the repo exists. I have written separately about detecting AI in invention reporting — this is the infrastructure half of that part. Inventor’s-notebook problems are exactly the reason SwarmKV is built around.


2. Why does prefill exist at all? (a one-minute crash course)

Skip this if you already know. For everyone else, here is the short version.

An autoregressive LLM serves a request in two phases. Prefill is the dense pass that pushes every prompt token through every transformer layer once and populates the per-layer key/value (KV) cache. Decode then runs token by token, attending to the prefilled KV cache and growing it incrementally.

Prefill cost grows roughly linearly with prompt length. Decode, by comparison, is cheap per token. On a Pascal-class GTX 1080 running Qwen2.5-7B Q4_K_M, prefilling a ~3,500-token document takes about 4.3 seconds; decoding a short branch prompt takes hundreds of milliseconds, since it is dominated by setup, not arithmetic. That time difference between the prefill and decode is exactly the leverage SwarmKV uses.

Mainstream serving stacks (vLLM, TGI, SGLang, llama.cpp’s own server) treat every request as an independent context. Some of them have prefix caching, but it is usually request-scoped or session-scoped — not graph-scoped. They are built to maximize throughput across many independent user prompts, not to share state inside one analytical pipeline that fans out from a single shared document. For that DAG-shaped workload — one root, many leaves, same data — every public stack I tried made me pay for the root once per leaf.

SwarmKV serves as the explicit orchestration layer, leveraged in C++ to bypass runtime abstractions, guarantee deterministic pointer life-cycles, and drive hardware-level memcpy efficiency.


3. The “just snapshot the KV” lightbulb (and why it’s harder than it sounds)

The pitch is simple:

  1. Run prefill once on the shared document under sequence id kSwarmkvPrefixSeqId.
  2. Serialize the resulting KV state into a host buffer via llama_state_get_data.
  3. For each downstream branch, memcpy that buffer into a per-branch allocation.
  4. Spin up a fresh llama_context, call llama_state_seq_set_data to install the snapshot, then decode the branch prompt with RoPE positions continuing from prefix_seq_len.

This is the ‘compute once, fan out’ paradigm. The only reason it takes more than a 30-line llama.cpp patch to achieve is that three tedious edge cases immediately break the naive approach. The concept is beautifully simple and should be an easy weekend project, but low-level hardware and systems realities make it a massive engineering challenge to actually implement.

Problem A: How big is the KV?

An easy answer: n_layers × n_head_kv × n_ctx × head_dim × dtype × 2. Well, that hand-derived number drifts every time the quantization format changes, every time the GQA ratio changes, or every time the engine adds a new state field. The only honest number is the one the engine tells you under the current build.

So MemoryPool spins up a disposable llama_context solely to ask:

size_t MemoryPool::get_required_kv_size(uint32_t n_ctx) {
    // Start from library defaults so fields we do not care about remain sane.
    llama_context_params params = llama_context_default_params();
    params.n_ctx = n_ctx;

    // Construct a disposable context solely to query serialized state footprint.
    llama_context * ctx = llama_init_from_model(model_ref, params);
    if (!ctx) {
        throw std::runtime_error("MemoryPool::get_required_kv_size: llama_init_from_model failed.");
    }

    // Ask llama.cpp how many bytes a full state blob would occupy for this ctx.
    const size_t sz = llama_state_get_size(ctx);
    llama_free(ctx);

    // If the engine reports zero, fall back to a small non-zero allocation so tests
    // still exercise the registry without pretending we know exact tensor layouts.
    if (sz == 0) {
        return size_t{1} << 20;
    }
    return sz;
}

The logic here is simple: ask the engine politely instead of trusting a pdf or a mathematical formula. Just spin up a context, ask the size and allocate exactly that much – a simple yet very successful recipe.

Problem B: llama.cpp is a picky eater about concurrent decode

Under the pinned upstream llama.cpp revision and GPU configuration used in this project, concurrent decode from multiple threads on a single GPU was not reliably safe. The exact behaviour depends on the backend, the version, and the graph scheduler — in newer revisions or with isolated streams it may behave better — but in our setup the failure modes were one of: (a) crash, (b) corrupted KV, or (c) a ten-minute hang while you Google whether ggml has a thread-local arena yet. Spoiler: in the pinned upstream, not really.

The robust answer is to serialize the llama API surface at the boundary:

namespace swarmkv {

// llama.cpp CUDA paths are not safe for concurrent decode from multiple threads
// on one GPU without external serialization. All node execute() bodies must
// hold this mutex around llama_init / llama_decode / llama_free / state I/O.
inline std::mutex & llama_api_mutex() {
    static std::mutex m;
    return m;
}

} // namespace swarmkv

struct LlamaGuard {
    std::lock_guard<std::mutex> lock;
    LlamaGuard() : lock(swarmkv::llama_api_mutex()) {}
};

A simple 20-line header defines the entire concurrency policy. Every node’s execute() body holds this around llama_init_from_model / llama_decode / llama_state_seq_set_data / llama_free. The DAG-level concurrency is real (futures, dependencies, fanout); the GPU compute interleaves under a global lock. Pedants will correctly note this leaves perf on the floor compared to a hypothetical concurrent-decode upstream. Hold that thought — it is the exact bottleneck Part 2 of this series goes after.

Problem C: There is no stable external KV bind API

The aesthetically perfect implementation would be to allocate one contiguous KV buffer, attach it to the new context directly and then skip the memcpy entirely. Upstream llama.cpp exposes llama_memory_t and graph decode paths, but the public header pinned in this repo does not ship a stable, exported llama_kv_cache_bind-style symbol.

So SwarmKV does the next-best thing: it keeps the call site, names it honestly, and writes this path on top of llama_state_set_data instead.

void KVHandoff::bind_contiguous_cache(llama_context * ctx, ggml_backend_buffer_t cache) {
    // Validate arguments so misuse fails fast during bring-up and CI smoke runs.
    // A null context cannot decode; a null cache handle is a configuration bug.
    if (!ctx || !cache) {
        throw std::invalid_argument("KVHandoff::bind_contiguous_cache: null context or buffer.");
    }

    // Explicitly mark both parameters as intentionally unused in this revision.
    // This prevents -Wunused-parameter warnings under strict warning flags.
    (void) ctx;
    (void) cache;

    // No stable bind call is issued here; see file-level comment above.
    // When upstream adds a supported attachment API, implement it only in this function.
}

I know, I know. It is a function that does nothing. It has full argument validation, a docstring twice the length of the body, and a stable position in the call graph. It is patiently waiting for the day upstream lets it actually do its job. I have written more honest code in my life, I just cannot remember when!

This is also the part where careful readers go “wait, if bind_contiguous_cache is a no-op, what is the MemoryPool buffer even for?” Excellent question. It is the staging area — the canonical buffer where PrefillNode writes its llama_state_get_data blob, and the source that each branch memcpys from. Decode itself uses the context’s internally-managed KV. Pool buffer = host-side fan-out scratch; context KV = the engine’s own thing. Two memory regions, one snapshot, zero magic.


4. The five-step pipeline (the actually-cool part)

Step 0:  Validate doc + max_branch + 128 ≤ n_ctx        (context_budget.h, fail-fast)
Step 1:  Build the DAG; DFS-check for cycles            (Orchestrator)
Step 2:  Spawn std::async workers; gate on futures      (Orchestrator)
Step 3:  Prefill once, serialize KV to host buffer      (PrefillNode + MemoryPool)
Step 4:  memcpy snapshot → branch buffer → decode       (AnalyticalNode + KVHandoff)

Let’s walk through each one with the real code. The snippets have been kept short deliberately, while the full files are tiny and worth reading.

Step 0 — Fail-fast context budget

Three lines that save you from a 3 AM Slack message from your past self:

const int32_t required = prefix_tokens + max_branch + generation_headroom;
    if (required > limit) {
        throw std::runtime_error(
            "Context budget exceeded: prefix_tokens=" + std::to_string(prefix_tokens) +
            " max_branch_tokens=" + std::to_string(max_branch) +
            " headroom=" + std::to_string(generation_headroom) +
            " required=" + std::to_string(required) + " n_ctx=" + std::to_string(limit));
    }

This runs before any context is constructed, any pool buffer is allocated or any GPU memory is touched. If you ask SwarmKV to prefill 4,000 tokens into an n_ctx=4096 context with two branches and 128 tokens of decode headroom, it tells you the math does not work and goes to sleep. The kindest thing you can do for your future self is to reject impossible configurations before even allocating the first byte.

Step 1 — DAG cycle detection

The orchestrator does a standard 3-color DFS on the dependency adjacency list:

// dfs lambda walks adjacency lists and throws when a back-edge indicates a cycle.
    auto dfs = [&](auto self, const std::string & u) -> void {
        // Mark node u as currently on the recursion stack (visiting).
        state[u] = 1;
        // Explore all outgoing dependency edges from u to downstream nodes v.
        for (const auto & v : adj[u]) {
            // If v is visiting, we found a cycle u -> v and must abort pipeline setup.
            if (state[v] == 1) {
                // Throw with edge names so graph misconfiguration is easy to diagnose.
                throw std::runtime_error("Dependency cycle detected: " + u + " -> " + v);
            }
            // Recurse only when v has not been fully processed yet.
            if (state[v] == 0) {
                // Continue DFS from child node v.
                self(self, v);
            }
        }

I know it’s boring, but trust me, it’s necessary. It is the algorithmic equivalent of checking your shoelaces before running. Skip it once and your pipeline will spend the rest of its short life waiting on itself. The error message includes the offending edge, so you can find the typo without grepping.

Step 2 — One std::async per node, gated on shared futures

worker_tasks.push_back(std::async(
            std::launch::async,
            [this, name, state, dependencies, &completion_promises, &completion_futures]() {
                // Read this node's watermark requirement once for dependency gating decisions.
                const int32_t req = nodes.at(name)->required_prefix_tokens();
                // Wait for each upstream dependency according to V2 watermark rules.
                for (const auto & dep_name : dependencies) {
                    // Resolve upstream node pointer for prefill provider detection.
                    ExecutionNode * dep = nodes.at(dep_name).get();
                    // If upstream is prefiller and this branch uses watermark gating, wait on watermark.
                    if (dep->is_prefill_provider() && req >= 0) {
                        // Block until PipelineState watermark >= required_prefix_tokens (speculative start).
                        state->wait_for_watermark(req);
                    } else {
                        // Otherwise preserve V1 behavior: wait until upstream node thread completes.
                        completion_futures.at(dep_name).wait();
                    }
                }
                // Build llama_context_params with orchestrator default n_ctx budget.
                llama_context_params params = llama_context_default_params();
                // Lift n_ctx to SwarmKV default pipeline context for multi-k token documents.
                params.n_ctx = kSwarmkvDefaultPipelineCtx;
                // Bundle model/pool/name into OrchestratorContext for node execute().
                OrchestratorContext ctx = {
                    this->memory_pool->get_model(),
                    params,
                    this->memory_pool,
                    name.c_str(),
                };
                // Run node logic and fulfill promise so dependents can proceed.
                try {
                    // Dispatch to PrefillNode or AnalyticalNode implementation.
                    nodes.at(name)->execute(state, &ctx);
                    // Signal successful completion to shared_future waiters.
                    completion_promises.at(name).set_value();
                } catch (...) {
                    if (req > 0) {
                        state->signal_milestone_consumed(req);
                    }
                    try {
                        completion_promises.at(name).set_exception(std::current_exception());
                    } catch (...) {
                    }
                    throw;
                }
            }));

One std::promise per node, with a std::shared_future so multiple downstream branches can wait on the same upstream completion without playing pass-the-future. The failure path always sets the exception, so dependents do not wait forever. We have all debugged the alternative, and oh boy did we not enjoy it!

Notice what is not in this loop: any logic about prefill, KV, or branches. The orchestrator does not know what a PrefillNode is. It knows about names, edges, and promises. The node-specific work lives in execute() and is fully polymorphic behind the ExecutionNode virtual interface. Only one responsibility for a child, not overwhelming at all!

Step 3 — Prefill once, export KV

PrefillNode does four things in the following sequence:

  1. Read the document text from examples/base_doc.txt.
  2. Tokenize it (with the resize-on-negative-return llama idiom).
  3. Decode the tokens in chunks bounded by llama_n_batch(lctx), on sequence lane kSwarmkvPrefixSeqId, with absolute RoPE positions matching the absolute token index:
// Absolute RoPE position equals index in the full document token stream.
batch.pos[i] = cur + i;
// Each token belongs to exactly one sequence id list.
batch.n_seq_id[i] = 1;
// Bind all document tokens to the shared prefix sequence lane constant.
batch.seq_id[i][0] = kSwarmkvPrefixSeqId;
// Disable logits during prefill except we keep zeros for all tokens here.
batch.logits[i] = 0;

4. Export the prefix-sequence KV into the canonical host buffer and stamp the watermark for the branches.

KVHandoff::bind_contiguous_cache(lctx, state->materialized_branch_buffer);
// Mark prefill_complete so branches using kSwarmkvWaitForPrefillComplete can proceed.
state->mark_prefill_complete();

That is the entire point of the article in two lines. Everything else in this repo — the orchestrator, the LlamaGuard, the budget check, the documented no-op — exists to feed those two lines and to deliver their output to the branches in a single memcpy with no extra round trips.

Step 4 — Branch decode under LlamaGuard

1. Allocate a per-branch buffer sized for the same n_ctx as the prefix:

// Allocate a branch buffer sized for n_ctx so later decode has headroom in the same blob policy.
branch_buf = ctx->memory_pool->allocate_branch_cache(static_cast<uint32_t>(ctx->ctx_params.n_ctx));

2. Copy the canonical snapshot into the branch buffer, a.k.a., the famous “copy-on-fork”:

// Full-prefill path copies from canonical staging into the branch allocation.
KVHandoff::materialize_branch_cache(
    state->materialized_branch_buffer,
    branch_buf,
    fork_kv_bytes);

…which, when you click through, becomes

std::memcpy(dst_ptr, src_ptr, ncopy);

That is it. That is “copy-on-fork at the storage layer”, which literally is memcpy. It is the primitive. Everything fancy you have read about prefix sharing — RadixAttention’s reference counting, paged attention’s block table indirection — is sitting on top of the same idea: don’t recompute, copy the bytes.

3. Spin up a fresh llama_context and restore the snapshot:

// Restore only the prefix sequence lane so branch decode stays isolated on seq 0.
const size_t n = llama_state_seq_set_data(
    lctx,
    static_cast<const uint8_t *>(base),
    fork_kv_bytes,
    kSwarmkvPrefixSeqId);
    // Verify llama consumed exactly the number of bytes we copied into the branch buffer.
if (n != fork_kv_bytes) {
    // Free the context before throwing to avoid leaking VRAM on failure paths.
    llama_free(lctx);
    // Throw with a clear message so operators can debug size mismatches quickly.
    throw std::runtime_error("AnalyticalNode: llama_state_seq_set_data size mismatch.");
}

4. Build a single llama_batch for the short branch prompt with RoPE positions continuing from where the prefix ended:

for (int i = 0; i < batch.n_tokens; ++i) {
    // Copy the i-th branch token id into the batch slot.
    batch.token[i] = tokens[static_cast<size_t>(i)];
    // Place branch tokens immediately after the forked prefix positions for correct RoPE.
    batch.pos[i] = static_cast<llama_pos>(fork_prefix_len) + static_cast<llama_pos>(i);
    // Each token participates in exactly one sequence id list entry.
    batch.n_seq_id[i] = 1;
    // Bind all branch tokens to the shared prefix sequence lane constant.
    batch.seq_id[i][0] = kSwarmkvPrefixSeqId;
    // Disable logits for all tokens except the last one in this branch step.
    batch.logits[i] = 0;
}

This is the bit everyone gets wrong the first time. If you forget the offset and start branch positions at zero, rotary embeddings silently go sideways and the model decodes from a position the prefix was never trained for. The symptom is confidently coherent nonsense. Welcome to the worst kind of debugging hell; please leave a tip on your way out.

5. Finally, one llama_decode. Just write a diagnostic string into PipelineState::node_outputs, record timings_ms[name], free the batch and the context.

Three lines of business logic per branch. Two contexts in flight at decode time. One memcpy each. One global lock. Everything else is plumbing.


5. The receipts (i.e., the numbers)

Now is the time to evaluate it against the baseline, and see if it was worth doing all these hassles. All numbers come from examples/example-run-results/.

Quick note on methodology before anyone reaches for the rocks: every comparison below runs the same model (Qwen2.5-7B-Instruct-Q4_K_M.gguf), the same document (a deterministic 3,501-token synthetic doc generated by repeating “The quick brown fox jumps over the lazy dog. “ until the token target is hit — examples/base_doc.txt), the same GPU (GTX 1080, 8 GiB, Pascal sm_61), the same n_ctx=4096, and the same dtype. Baseline = two sequential llama_context instances, each prefilling the full document then decoding its branch prompt. SwarmKV = PrefillNode once + two AnalyticalNode branches over the snapshot. Workload type: prefill-dominated document analysis (RAG-style), not autoregressive chat. Three trials run back-to-back with a GPU-idle wait between them; the best is selected by 2·TTFT_pct + E2E_pct.

One metric definition before the table, because it matters: we use “Branch-2 activation latency (TTFT proxy)” — not the textbook serving-literature “request-arrival → first-output-token” TTFT. We mean the time the second branch spends in branch-specific work: its activation latency after the shared prefill is amortized across all branches. In a fan-out pipeline the cost the downstream consumer perceives is exactly this number, because the upstream prefill is paid once for the whole pipeline by design. The baseline value for this metric is the redundant document prefill that the second llama_context is forced to redo before it can answer; the SwarmKV value is the fork + restore + short-prompt decode.

Headline: GTX 1080, Qwen2.5-7B Q4_K_M, 3,501-token doc, two branches

Metric Baseline (HF-style) SwarmKV Delta
End-to-end wall clock 10,275 ms 5,272 ms −48.69 % (~1.95×)
Branch-2 activation latency (TTFT proxy) 4,339 ms 83 ms −98.09 % (~52.3×)
Baseline Agent-1 prefill 4,346 ms
Baseline Agent-2 prefill 4,339 ms
SwarmKV per-branch decode (avg) 77 ms
Redundant prefill eliminated 8,685 ms

Translation: the baseline spent 4,339 ms of the second agent’s perceived latency re-doing the dense attention pass it had just finished four seconds earlier on the same bytes. SwarmKV looks at that and says “what if we didn’t?” and ships an 83-millisecond answer. The cleanest single-number measurement of “how expensive was that prefill?” is just the ratio of those two timings; everything else in the branch is a rounding error.

Where the per-branch ~83 ms goes

The thesis of this whole article rests on one inequality: per-branch restore + decode is much, much cheaper than a redundant document prefill. The harness measures this directly at the aggregate level — the per-branch wall clock (allocate + copy + restore + decode, end to end) is 71–83 ms depending on which branch we look at, against a redundant prefill cost of ~4,339 ms. A ~52× ratio at the aggregate level is what makes everything else in this article work.

For more results and numbers, I’ll propose to look at directly at the example run report.


6. “OK, but how is this different from vLLM / prefix caching / SGLang RadixAttention?”

A very reasonable question, and worth answering directly, because the inference-infra world has a lot of overlapping primitives and an HPC reader will ask this in the first comment.

  • vLLM / continuous batching / paged attention. Optimized for multi-tenant decode-time serving: many concurrent requests at different decode steps, scheduling the next token across them under streaming load. Headline primitive: paged attention. Unit of work: a streaming firehose of independent user prompts.
  • TGI / vLLM prefix caching. Excellent if your shared prefix is request-scoped or session-scoped. Not designed to expose KV snapshots as first-class objects you can hand to a different llama_context running a different downstream task in the same process.
  • SGLang RadixAttention. Tree-shaped prefix sharing inside a serving runtime — the closest cousin, but it is a server, not a single-process orchestration primitive.
  • llama.cpp’s own state save/restore. Exists, per-context. SwarmKV is the pipeline-level glue: a DAG, a host-buffer arena sized by the engine itself, a memcpy fan-out, a LlamaGuard policy, and a documented no-op patiently awaiting an upstream bind API.

7. So… how do I actually try it?

Well, I already posted the Github link at the start of the article. If you have come so far down, please work hard one more time and scroll back up to the top.

Artifacts land under examples/example-run-results/: best_run.json, all_trials.csv, plots/*.png, and a narrative final_result.docx that walks through methodology and limitations.

Requirements: Linux, CUDA toolkit, an NVIDIA GPU (Pascal or newer; consumer or datacenter both work), a GGUF model that fits in your VRAM, and the patience to read a CMake file once.


8. Plot twist — this is just SIB broadcast in a transformer costume

I should probably confess at this point: I am not a “GPU person” by training. I came up through telecom — 5G NR with a foot creeping firmly into 6G research — and I started looking at LLM inference infrastructure because every problem in this codebase felt strangely familiar.

One-sentence decoder ring for readers without a 3GPP background: in a 5G network, the cell tower does not unicast network configuration to every phone separately — it broadcasts a small set of System Information Blocks (SIB1, SIB2, …) once on a shared channel, every phone in range reads the same broadcast, and per-user data rides on top of that shared context on a dedicated channel. The acronyms in the table below — MIB (Master Information Block, the very first thing every phone reads), PBCH and PDSCH (the shared broadcast and downlink data channels), HARQ (the receiver’s “keep what we already decoded, only re-send what was missing” retransmission protocol), and RNTI (the temporary ID that distinguishes one phone’s traffic from another’s) — are just names for the channels and identifiers that separate shared, computed once from unique per consumer. That distinction is the whole analogy.

Look at this side-by-side and tell me with a straight face these are different problems:

5G NR cell broadcast (at the gNB) SwarmKV (at the GPU)
One MIB on PBCH per SS burst One shared document tokenized once
Repeated SIBs (SIB1, SIB2, …) on PDSCH Serialized KV snapshot in MemoryPool
Every camped UE in the cell reads the same SI Every analytical branch reads the same snapshot
UE-specific dedicated PDSCH for unicast user data Per-branch llama_context decoding the branch prompt
RNTI per UE distinguishes unicast streams Per-branch buffer + sequence id distinguishes branch state
HARQ soft-buffer retained across retransmissions KV snapshot retained across branches
Skip broadcast → every UE forces unicast SI → air interface melts Skip snapshot → every branch re-prefills the doc → GPU melts

A quick aside to two very different audiences

To my HPC and CUDA-first friends reading this: I know. KV reuse is not a new idea. vLLM has prefix caching, SGLang has RadixAttention, llama.cpp itself exposes state save/restore. SwarmKV’s contribution is not the primitive; it is the single-process orchestration shape — a tiny C++ DAG runtime that exposes “prefill once, fan out N branches” as a first-class operation, sized for one 8 GiB consumer GPU, with the safety rails (LlamaGuard, swarmkv_validate_context_budget, the documented bind no-op) that a researcher actually needs to ship a demo on a Tuesday. Please put the pitchforks down.

To my telecom friends: if “KV cache” sounded like a foreign language until ten minutes ago, you are not behind — you are early. For twenty years our world was FPGAs, ASICs, and PRBs. We optimized spectrum, not silicon. Then AI-RAN, NWDAF, NVIDIA Aerial, the AI-RAN Alliance, and the 3GPP Rel-20 study items all happened in roughly the same eighteen months, and the next decade of telecom careers now demands being bilingual between spectrum-world and GPU-world. The intuition translates cleanly. You have been fanning out shared computation to many consumers since the first CRS pilot. Same animal, just a new zoo.


9. Honest caveats (because the comments are coming)

If you came here to find what is wrong with the project — congratulations, the project found its first reader. From the limitations section of final_result.docx and the inline comments in the source:

  1. KV staging is host-side. MemoryPool allocates ggml_backend_buffer_t from the CPU device (ggml_backend_dev_by_type(GGML_BACKEND_DEVICE_TYPE_CPU)). Branch decode still runs on the GPU; only the snapshot transit is host-staged via llama_state_get_data → memcpy → llama_state_seq_set_data. A device-aware materialize lives on the roadmap, blocked on the same upstream KV bind API that bind_contiguous_cache is waiting for.
  2. Shared decode mutex (under the pinned upstream revision). LlamaGuard serializes every llama_* call from worker threads. Under the llama.cpp revision and GPU configuration used in this project, concurrent decode from multiple threads on a single GPU was not reliably safe — the exact behaviour depends on backend, version, and graph scheduling, but in our setup the conservative choice was a global lock. The DAG-level concurrency is real, but per-request GPU compute remains sequential. This is the single biggest performance limitation in V1, and it is exactly where Part 2 of this series picks up.
  3. SwarmKV_Prefill_Ms reports 0. Known instrumentation gap in how OrchestratorContext::node_name is consumed inside PrefillNode. The prefill ran (you see its cost in End_To_End_Ms and the derived effective shared-prefill cost), it is just not being keyed correctly into timings_ms. The effective shared prefill is calculated as SwarmKV_End_To_End_Ms − max(SwarmKV_AgentA_Ms, SwarmKV_AgentB_Ms) ≈ 5,189 ms. Reporting bug, not correctness bug. Logged.
  4. Synthetic document. The benchmark builds a deterministic 3,501-token document by repeating “The quick brown fox jumps over the lazy dog. ” until the token target is hit. This isolates the performance signal from content effects and keeps trials reproducible bit-for-bit. Real documents will produce noisier per-trial absolute timings; the structural ratios will not move.
  5. Single GPU class. All numbers in the report come from one Pascal-class GTX 1080. Newer GPUs (Ada, Hopper) prefill much faster — the absolute ms numbers will shrink, but the structural ratio between full-prefill cost and short-decode cost (which is what SwarmKV exploits) does not.
  6. bind_contiguous_cache is a documented no-op. Yes, still. Until upstream lands a stable external-KV attachment API, the function validates its arguments, casts them to void, and goes home.

But don’t worry, everything on this list is on the roadmap. None of it changes the headline result though. The point of putting it in writing is that you should not have to dig for it — and the moment a benchmark blog post hides its caveats is the moment its numbers stop being trustworthy.


10. The V1 ceiling (and the setup for Part 2)

SwarmKV proves that you can stop re-prefilling. But if you reread caveat #2, you have already spotted the next ceiling: the GPU compute itself is still serialized.

Here is what actually happens on the wall clock. The DAG-level concurrency is genuine — branches are real std::async workers with real dependency gating. But every branch’s llama_decode runs inside LlamaGuard, a single global mutex. So while the orchestration fans out, the GPU work lines up single file. Two branches take turns. Fifty branches take fifty turns. The GPU is never actually shared; it is time-multiplexed by hand, one lock at a time, with no fairness guarantee and no way to measure who is starving whom.

That is fine for a two-agent demo. It falls apart the moment you run the workload SwarmKV is actually built for: 50 specialized micro-agents competing for one GPU. At that scale you stop caring about “did we avoid re-prefill” and start caring about questions a hand-rolled mutex cannot answer:

  • When 50 agents want the GPU at once, who goes first, and how do we make it fair?
  • What is the p50, p95, and p99 latency each agent sees while sharing one card?
  • How much jitter does contention add, and where does throughput collapse?
  • How do we slice GPU compute cycles on purpose instead of by accident?

That is Part 2 of this series: Time-Slicing the GPU for Concurrent Agent Swarms. Toy agents run sequentially in Python. Production agents run concurrently on bare metal, and managing VRAM and compute when many micro-agents share one NVIDIA GPU is its own discipline. Part 2 builds a Kubernetes-level time-slice profiler that dynamically allocates compute cycles and measures p50/p95/p99 latency, jitter, and throughput proxies when agentic inference workloads share a GPU via the Kubernetes Device Plugin with CUDA time-slicing. The global mutex in SwarmKV is exactly the thing it replaces with something measurable.

(For the curious: there is a separate, orthogonal V1 limitation worth a future SwarmKV V2 post — the pipeline currently waits for the entire prefill to finish before any branch starts, even when a branch only needs the first 500 tokens of context. Letting branches start the instant their required prefix slice is materialized is a real win, but it is its own story and its own benchmark. It is not Part 2. Part 2 is about sharing the GPU across many agents; that prefill-streaming idea is a follow-up.)

See you in Part 2.


Disclaimer: The illustrations in this article (the hero banner, the architecture diagram, the telecom-vs-SwarmKV split panel, and the GPU time-slicing image) were generated using AI (Claude Opus 4.8). They are illustrative, not photographic, and any labels visible inside the images are stylized rather than authoritative — refer to the article body and the code itself for precise function names, metric values, and architecture details.

Share this Article
Please enter CoinGecko Free Api Key to get this plugin works.