Production-Ready LLM Agents: A Comprehensive Framework for Offline Evaluation

Editor
26 Min Read


Introduction & Context

a well-funded AI team demo their multi-agent financial assistant to the executive committee. The system was impressive — routing queries intelligently, pulling relevant documents, generating articulate responses. Heads nodded. Budgets were approved. Then someone asked: “How do we know it’s ready for production?” The room went quiet.

This scene plays out frequently across the industry. We’ve become remarkably good at building sophisticated agent systems, but we haven’t developed the same rigor around proving they work. When I ask teams how they validate their agents before deployment, I typically hear some combination of “we tested it manually,” “the demo went well,” and “we’ll monitor it in production.” None of these are wrong, but none of them constitute a quality gate that governance can sign off on or that engineering can automate.

The Problem: Evaluating Non-deterministic Multi-Agent Systems

The challenge isn’t that teams don’t care about quality — they do. The challenge is that evaluating LLM-based systems is genuinely hard, and multi-agent architectures make it harder.

Traditional software testing assumes determinism. Given input X, we expect output Y, and we write an assertion to validate. But if we ask an LLM the same question twice and we’ll get different phrasings, different structures, sometimes different emphasis. Both responses might be correct. Or one might be subtly wrong in ways that aren’t obvious without domain expertise. The assertion-based mental model breaks down.

Now multiply this complexity across a multi-agent system. A router agent decides which specialist handles the query. That specialist might retrieve documents from a knowledge base. The retrieved context shapes the generated response. A failure anywhere in this chain degrades the output, but diagnosing where things went wrong requires evaluating each component.

I’ve observed that teams need answers to three distinct questions before they can confidently deploy:

  1. Is the router doing its job? When a user asks a simple question, does it go to the fast, cheap agent? When they ask something complex, does it route to the agent with deeper capabilities? Getting this wrong has real consequences — either you’re wasting money and time on over-engineered responses, or you’re giving users shallow answers to questions that deserve depth.
  2. Are the responses actually good? This sounds obvious, but “good” has multiple dimensions. Is the information accurate? If the agent is doing analysis, is the reasoning sound? If it’s generating a report, is it complete? Different query types need different quality criteria.
  3. For agents using retrieval, is the RAG pipeline working? Did we pull the right documents? Did the agent actually use them, or did it hallucinate information that sounds plausible but isn’t grounded in the retrieved context?

Offline vs Online: A Brief Distinction

Before diving into the framework, I want to clarify what I mean by “offline evaluation” because the terminology can be confusing.

Offline evaluation happens before deployment, against a curated dataset where you know the expected outcomes. You’re testing in a controlled environment with no user impact. This is your quality gate — the checkpoint that determines whether a model version is ready for production.

Online evaluation happens after deployment, against live traffic. You’re monitoring real user interactions, sampling responses for quality checks, detecting drift. This is your safety net — the ongoing assurance that production behavior matches expectations.

Both matter, but they serve different purposes. This article focuses on offline evaluation because that’s where I see the biggest gap in current practice. Teams often jump straight to “we’ll monitor it in production” without establishing what “good” looks like beforehand. That’s backwards. You need offline evaluation to define your quality baseline before online evaluation can tell you whether you’re maintaining it.

Article Roadmap

Here, I present a framework I’ve developed and refined across multiple agent deployments. I’ll walk through a reference architecture that illustrates common evaluation challenges, then introduce what I call the Three Pillars of offline evaluation — routing, LLM-as-judge, and RAG evaluation. For each pillar, I’ll explain not just what to measure but why it matters and how to interpret the results. Finally, I’ll cover how to operationalize with automation (CI/CD) and connect it to governance requirements.

The System under Evaluation

Reference Architecture

To make this concrete, I’ll take an example that is becoming more common in the current environment. A financial services company is modernizing its tools and services supporting its advisors who serve end customers. One of the applications is a financial research assistant with capabilities to lookup financial instruments, do various analysis and conduct detailed research.

Multi-Agent system – Financial Research Assistant: image by author

This is architected as a multi agent system with different agents using different models based on task need and complexity. The router agent sits at the front, classifying incoming queries by complexity and directing them appropriately. Done well, this optimizes both cost and user experience. Done poorly, it creates frustrating mismatches — users waiting for simple answers, or getting superficial responses to complex questions.

Evaluation Challenges

This architecture is elegant in theory but creates evaluation challenges in practice. Different agents need different evaluation criteria, and this isn’t always obvious upfront.

  • The simple agent needs to be fast and factually accurate, but nobody expects it to provide deep reasoning.
  • The analysis agent needs to demonstrate sound logic, not just accurate facts.
  • The research agent needs to be comprehensive — missing a major risk factor in an investment analysis is a failure even if everything else is correct.
  • Then there’s the RAG dimension. For the agents that retrieve documents, you have a whole separate set of questions. Did we retrieve the right documents? Did the agent actually use them? Or did it ignore the retrieved context and generate something plausible-sounding but ungrounded?

Evaluating this system requires evaluating multiple components with different criteria. Let’s see how we approach this.

Three Pillars of Offline Evaluation

Framework Overview

Over the past two years, working across various agent implementations, I’ve converged on a framework with three evaluation pillars. Each addresses a distinct failure mode, and together they provide reasonable coverage of what can go wrong.

Offline Evaluation Framework: image by author

The pillars aren’t independent. Routing affects which agent handles the query, which affects whether RAG is involved, which affects what evaluation criteria apply. But separating them analytically helps you diagnose where problems originate rather than just observing that something went wrong.

One important principle: not every evaluation runs on every query. Running comprehensive RAG evaluation on a simple price lookup is wasteful — there’s no RAG to evaluate. Running only factual accuracy checks on a complex research report misses whether the reasoning was sound or the coverage was complete.

Pillar 1: Routing Evaluation

Routing evaluation answers what seems like a simple question: did the router pick the right agent? In practice, getting this right is trickier than it appears, and getting it wrong has cascading consequences.

I think about routing failures in two categories. Under-routing happens when a complex query goes to a simple agent. The user asks for a comparative analysis and gets back a superficial response that doesn’t address the nuances of their question. They’re frustrated, and rightfully so — the system had the capability to help them but didn’t deploy it.

Over-routing is the opposite: simple queries going to complex agents. The user asks for a stock price and waits fifteen seconds while the research agent spins up, retrieves documents it doesn’t need, and generates an elaborate response to a question that deserved three words. The answer is probably fine, but you’ve wasted compute, money, and the user’s time.

In one engagement, we discovered that the router was over-routing about 40% of simple queries. The responses were good, so nobody had complained, but the system was spending five times what it should have on those queries. Fixing the router’s classification logic cut costs significantly without any degradation in user-perceived quality.

Router evaluation approaches: image by author

For evaluation, I use two approaches depending on the situation. Deterministic evaluation: Create a test dataset where each query is labeled with the expected agent, measure what percentage the router gets right. This is fast, cheap, and gives a clear accuracy number.

LLM-based evaluation: adds nuance for ambiguous cases. Some queries genuinely could go either way — “Tell me about Microsoft’s business” could be a simple overview or a deep analysis depending on what the user actually wants. When the router’s choice differs from your label, an LLM judge can assess whether the choice was reasonable even if it wasn’t what you expected. This is more expensive but helps you distinguish true errors from judgment calls.

The metrics I track include overall routing accuracy, which is the headline number, but also a confusion matrix showing which agents get confused with which. If the router consistently sends analysis queries to the research agent, that’s a specific calibration issue you can address. I also track over-routing and under-routing rates separately because they have different business impacts and different fixes.

Pillar 2: LLM-as-Judge Evaluation

The challenge with evaluating LLM outputs is that they are not deterministic, so they cannot be matched against an expected answer. Valid responses vary in phrasing, structure, and emphasis. You need evaluation that understands semantic equivalence, assesses reasoning quality, and catches subtle factual errors. Human evaluation does this well but doesn’t scale. It is not feasible to have someone manually review thousands of test cases on every deployment.

LLM-as-judge addresses this by using a capable language model to evaluate other models’ outputs. You provide the judge with the query, the response, your evaluation criteria, and any ground truth you have, and it returns a structured assessment. The approach has been validated in research showing strong correlation with human judgments when the evaluation criteria are well-specified.

A few practical notes before diving into the dimensions. Your judge model should be at least as capable as the models you’re evaluating — I typically use Claude Sonnet or GPT-4 for judging. Using a weaker model as judge leads to unreliable assessments. Also, judge prompts need to be specific and structured. Vague instructions like “rate the quality” produce inconsistent results. Detailed rubrics with clear scoring criteria produce usable evaluations.

I evaluate three dimensions, applied selectively based on query complexity.

LLM-as-judge evaluation metrics: image by author

Factual accuracy is foundational. The judge extracts factual claims from the response and verifies each against your ground truth. For a financial query, this might mean checking that the P/E ratio cited is correct, that the revenue figure is accurate, that the growth rate matches reality. The output is an accuracy score plus a breakdown of which facts were correct, incorrect, or missing.

This applies to all queries regardless of complexity. Even simple lookups need factual verification — arguably especially simple lookups, since users trust straightforward factual responses and errors undermine that trust.

Reasoning quality matters for analytical responses. When the agent is comparing investment options or assessing risk, you need to evaluate not just whether the facts are right but whether the logic is sound. Does the conclusion follow from the premises? Are claims supported by evidence? Are assumptions made explicit? Does the response acknowledge uncertainty appropriately?

I only run reasoning evaluation on medium and high complexity queries. Simple factual lookups don’t involve reasoning — there’s nothing to evaluate. But for anything analytical, reasoning quality is often more important than factual accuracy. A response can cite correct numbers but draw invalid conclusions from them, and that’s a serious failure.

Completeness applies to comprehensive outputs like research reports. When a user asks for an investment analysis, they expect coverage of certain elements: financial performance, competitive position, risk factors, growth catalysts. Missing a major element is a failure even if everything included is accurate and well-reasoned.

LLM-AS-JUDGE evaluation scores: image by author

I run completeness evaluation only on high complexity queries where comprehensive coverage is expected. For simpler queries, completeness isn’t meaningful — you don’t expect a stock price lookup to cover risk factors.

The judge prompt structure matters more than people realize. I always include the original query (so the judge understands context), the response being evaluated, the ground truth or evaluation criteria, a specific rubric explaining how to score each dimension, and a required output format (I use JSON for parseability). Investing time in prompt engineering for your judges pays off in evaluation reliability.

Pillar 3: RAG Evaluation

RAG evaluation addresses a failure mode that’s invisible if you only look at final outputs: the system generating plausible-sounding responses that aren’t actually grounded in retrieved knowledge.

The RAG pipeline has two stages, and either can fail. Retrieval failure means the system didn’t pull the right documents — either it retrieved irrelevant content or it missed documents that were relevant. Generation failure means the system retrieved good documents but didn’t use them properly, either ignoring them entirely or hallucinating information not present in the context.

Standard response evaluation conflates these failures. If the final answer is wrong, you don’t know whether retrieval failed or generation failed. RAG-specific evaluation separates the concerns so you can diagnose and fix the actual problem.

I use the RAGAS (Retrieval Augmented Generation Assessment) framework for this, which provides standardized metrics that have become industry standard. The metrics fall into two groups.

RAG evaluation metrics: image by author

Retrieval quality metrics assess whether the right documents were retrieved. Context precision measures what fraction of retrieved documents were actually relevant — if you retrieved four documents and only two were useful, that’s 50% precision. You’re pulling noise. Context recall measures what fraction of relevant documents were retrieved — if three documents were relevant and you only got two, that’s 67% recall. You’re missing information.

Generation quality metrics assess whether retrieved context was used properly. Faithfulness is the critical one: it measures whether claims in the response are supported by the retrieved context. If the response makes five claims and four are grounded in the retrieved documents, that’s 80% faithfulness. The fifth claim is either from the model’s parametric knowledge or hallucinated — either way, it’s not grounded in your retrieval, which is a problem if you’re relying on RAG for accuracy.

RAG evaluation scores: image by author

I want to emphasize faithfulness because it’s the metric most directly tied to hallucination risk in RAG systems. A response can sound authoritative and be completely fabricated. Faithfulness evaluation catches this by checking whether each claim traces back to retrieved content.

In one project, we found that faithfulness scores varied dramatically by query type. For straightforward factual queries, faithfulness was above 90%. For complex analytical queries, it dropped to around 60% — the model was doing more “reasoning” that went beyond the retrieved context. That’s not necessarily wrong, but it meant users couldn’t trust that analytical conclusions were grounded in the source documents. We ended up adjusting the prompts to more explicitly constrain the model to retrieved information for certain query types.

Implementation & Integration

Pipeline Architecture

The evaluation pipeline has four stages: load the dataset, execute the agent on each sample, run the appropriate evaluations, and aggregate into a report.

Offline evaluation pipeline: image by author

We start with the sample dataset to be evaluated. Each sample needs the query itself, metadata indicating complexity level and expected agent, ground truth facts for accuracy evaluation, and for RAG queries, the relevant documents that should be retrieved. Building this dataset is tedius work, but the quality of your evaluation depends entirely on the quality of your ground truth. See example below (Python code):

{
"id": "eval_001",
"query": "Compare Microsoft and Google's P/E ratios",
"category": "comparison",
"complexity": "medium",
"expected_agent": "analysis_agent",
"ground_truth_facts": [
"Microsoft P/E is approximately 35",
"Google P/E is approximately 25"
],
"ground_truth_answer": "Microsoft trades at higher P/E (~35) than Google (~25)...",
"relevant_documents": ["MSFT_10K_2024", "GOOGL_10K_2024"]
}

I recommend starting with at least 50 samples per complexity level, so 150 minimum for a three-tier system. More is better — 400 total gives you better statistical confidence in the metrics. Stratify across query categories so you’re not accidentally over-indexing on one type.

For observability, I use Langfuse, which provides trace storage, score attachment, and dataset run tracking. Each evaluation sample creates a trace, and each evaluation metric attaches as a score to that trace. Over time, you build a history of evaluation runs that you can compare across model versions, prompt changes, or architecture modifications. The ability to drill into specific failures and see the full trace is very helpful for troubleshooting.

Automated (CI/CD) Quality Gates

Evaluation becomes very powerful when it’s automated and blocking. Scheduled execution of evaluation against a representative dataset subset is a good start. The run produces metrics. If metrics fall below defined thresholds, the downstream governance mechanism kicks in whether quality reviews, failed gate checks etc.

The thresholds need to be calibrated to your use case and risk tolerance. For a financial application where accuracy is critical, I might set factual accuracy at 90% and faithfulness at 85%. For an internal productivity tool with lower stakes, 80% and 75% might be acceptable. The key is aligning the thresholds with governance and quality teams and applying them in a standard repeatable way.

I also recommend scheduled running of the evaluation against the full dataset, not just the subset used for PR checks. This catches drift in external dependencies — API changes, model updates, knowledge base modifications — that might not surface in the smaller PR dataset.

When evaluation fails, the pipeline should generate a failure report identifying which metrics missed threshold and which specific samples failed. This provides the necessary signals to the teams to resolve the failures

Governance & Compliance

For enterprise deployments, evaluation encompasses engineering quality and organizational accountability. Governance teams need evidence that AI systems meet defined standards. Compliance teams need audit trails. Risk teams need visibility into failure modes.

Offline evaluation provides this evidence. Every run creates a record: which model version was evaluated, which dataset was used, what scores were achieved, whether thresholds were met. These records accumulate into an audit trail demonstrating systematic quality assurance over time.

I recommend defining acceptance criteria collaboratively with governance stakeholders before the first evaluation run. What factual accuracy threshold is acceptable for your use case? What faithfulness level is required? Getting alignment upfront prevents confusion and conflict on interpreting results.

Evaluation metrics acceptable threshold definition: image by author

The criteria should reflect actual risk. A system providing medical information needs higher accuracy thresholds than one summarizing meeting notes. A system making financial recommendations needs higher faithfulness thresholds than one drafting marketing copy. One size doesn’t fit all, and governance teams understand this when you frame it in terms of risk.

Finally, think about reporting for different audiences. Engineering wants detailed breakdowns by metric and query type. Governance wants summary pass/fail status with trend lines. Executives want a dashboard showing green/yellow/red status across systems. Langfuse and similar tools support these different views, but you need to configure them intentionally.

Conclusion

The gap between impressive demos and production-ready systems is bridged through rigorous, systematic evaluation. The framework presented here provides the structure to build governance tailored to your specific agents, use cases, and risk tolerance.

Key Takeaways

  • Evaluation Requirements — Requirements vary depending on the application use case. A simple lookup needs factual accuracy checks. A complex analysis needs reasoning evaluation. A RAG-enabled response needs faithfulness verification. Applying the right evaluations to the right queries gives you signal without noise.
  • Automation- Manual evaluation doesn’t scale and doesn’t catch regressions. Integrating evaluation into CI/CD pipelines, with explicit thresholds that block deployment, turns quality assurance from an ad hoc action into a repeatable practice.
  • Governance — Evaluation records provide the audit trail that compliance needs and the evidence that leadership needs to approve production deployment. Building this connection early makes AI governance a partnership rather than an obstacle.

Where to Start

If you’re not doing systematic offline evaluation today, don’t try to implement everything at once.

  1. Start with routing accuracy and factual accuracy — these are the highest-signal metrics and the easiest to implement. Build a small evaluation dataset, maybe 50–100 samples. Run it manually a few times to calibrate your expectations.
  2. Add reasoning evaluation for complex queries and RAG metrics for retrieval-enabled agents. 
  3. Integrate into CI/CD. Define thresholds with your governance partners. Build, Test, Iterate.

The goal is to start laying the foundation and building processes to provide evidence of quality across defined criteria. That’s the foundation for production readiness, stakeholder confidence, and responsible AI deployment.

This article turned out to be lengthy one, thank you so much for sticking till the end. I hope you found this useful and would try these concepts. All the best and happy building 🙂

Share this Article
Please enter CoinGecko Free Api Key to get this plugin works.