Why Your AI Search Evaluation Is Probably Wrong (And How to Fix It)

Contents

A Baseline Evaluation Standard Step 1: Define what “good” means for your use case Step 2: Build your golden test set Step 3: Run controlled comparisons Step 4: Evaluate with LLM Judges Step 5: Measure evaluation stability with ICC What Success Actually Looks Like

for nearly a decade, and I’m often asked, “How do we know if our current AI setup is optimized?” The honest answer? Lots of testing. Clear benchmarks allow you to measure improvements, compare vendors, and justify ROI.

Most teams evaluate AI search by running a handful of queries and picking whichever system “feels” best. Then they spend six months integrating it, only to discover that accuracy is actually worse than that of their previous setup. Here’s how to avoid that $500K mistake.

The problem: ad-hoc testing doesn’t reflect production behavior, isn’t replicable, and corporate benchmarks aren’t customized to your use case. Effective benchmarks are tailored to your domain, cover different query types, produce consistent results, and account for disagreement among evaluators. After years of research on search quality evaluation, here’s the process that actually works in production.

A Baseline Evaluation Standard

Step 1: Define what “good” means for your use case

Before you even run a single test query, get specific about what a “right” answer looks like. Common traits include baseline accuracy, the freshness of results, and the relevance of sources.

For a financial services client, this may be: “Numerical data must be accurate to within 0.1% of official sources, cited with publication timestamps.” For a developer tools company: “Code examples must execute without modification in the specified language version.”

From there, document your threshold for switching providers. Instead of an arbitrary “5-15% improvement,” tie it to business impact: If a 1% accuracy improvement saves your support team 40 hours/month, and switching costs $10K in engineering time, you break even at 2.5% improvement in month one.

Step 2: Build your golden test set

A golden set is a curated collection of queries and answers that gets your organization on the same page about quality. Begin sourcing these queries by looking at your production query logs. I recommend filling your golden set with 80% of queries dedicated to common patterns and the remaining 20% to edge cases. For sample size, aim for 100-200 queries minimum; this produces confidence intervals of ±2-3%, tight enough to detect meaningful differences between providers.

From there, develop a grading rubric to assess the accuracy of each query. For factual queries, I define: “Score 4 if the result contains the exact answer with an authoritative citation. Score 3 if correct, but requires user inference. Score 2 if partially relevant. Score 1 if tangentially related. Score 0 if unrelated.” Include 5-10 example queries with scored results for each category.

Once you’ve established that list, have two domain experts independently label each query’s top-10 results and measure agreement with Cohen’s Kappa. If it’s below 0.60, there may be multiple issues, such as unclear criteria, inadequate training, or differences in judgment, that need to be addressed. When making revisions, use a changelog to capture new versions for each scoring rubric. You will want to maintain distinct versions for each test so you can reproduce them in later testing.

Step 3: Run controlled comparisons

Now that you have your list of test queries and a clear rubric to measure accuracy, run your query set across all providers in parallel and collect the top-10 results, including position, title, snippet, URL, and timestamp. You should also log query latency, HTTP status codes, API versions, and result counts.

For RAG pipelines or agentic search testing, pass each result through the same LLMs with identical synthesis prompts with temperature set to 0 (since you’re isolating search quality).

Most evaluations fail because they only run each query once. Search systems are inherently stochastic, so sampling randomness, API variability, and timeout behavior all introduce trial-to-trial variance. To measure this properly, run multiple trials per query (I recommend starting with n=8-16 trials for structured retrieval tasks, n≥32 for complex reasoning tasks).

Step 4: Evaluate with LLM Judges

Modern LLMs have significantly more reasoning capacity than search systems. Search engines use small re-rankers optimized for millisecond latency, while LLMs use 100B+ parameters with seconds to reason per judgment. This capacity asymmetry means LLMs can judge the quality of results more thoroughly than the systems that produced them.

However, this analysis only works if you equip the LLM with a detailed scoring prompt that uses the same rubric as human evaluators. Provide example queries with scored results as a demonstration, and require a structured JSON output with a relevance score (0-4) and a brief explanation per result.

At the same time, run an LLM judge and have two human experts score a 100-query validation subset covering easy, medium, and hard queries. Once that’s done, calculate inter-human agreement using Cohen’s Kappa (target: κ > 0.70) and Pearson correlation (target: r > 0.80). I’ve seen Claude Sonnet achieve 0.84 agreement with expert raters when the rubric is well-specified.

Step 5: Measure evaluation stability with ICC

Accuracy alone doesn’t tell you if your evaluation is trustworthy. You also need to know if the variance you’re seeing among search results reflects genuine differences in query difficulty, or just random noise from inconsistent model provider behavior.

The Intraclass Correlation Coefficient (ICC) splits variance into two buckets: between-query variance (some queries are just harder than others) and within-query variance (inconsistent results for the same query across runs).

Here’s how to interpret ICC when vetting AI search providers:

ICC ≥ 0.75: Good reliability. Provider responses are consistent.
ICC = 0.50-0.75: Moderate reliability. Mixed contribution from query difficulty and provider inconsistency.
ICC < 0.50: Poor reliability. Single-run results are unreliable.

Consider two providers, both achieving 73% accuracy:

Accuracy	ICC	Interpretation
73%	0.66	Consistent behavior across trials.
73%	0.30	Unpredictable. The same query produces different results.

Without ICC, you’d deploy the second provider, thinking you’re getting 73% accuracy, only to discover reliability problems in production.

In our research evaluating providers on GAIA (reasoning tasks) and FRAMES (retrieval tasks), we found ICC varies dramatically with task complexity, from 0.30 for complex reasoning with less capable models to 0.71 for structured retrieval. Often, accuracy improvements without ICC improvements reflected lucky sampling rather than genuine capability gains.

What Success Actually Looks Like

With that validation in place, you can evaluate providers across your full test set. Results might look like:

Provider A: 81.2% ± 2.1% accuracy (95% CI: 79.1-83.3%), ICC=0.68
Provider B: 78.9% ± 2.8% accuracy (95% CI: 76.1-81.7%), ICC=0.71

The intervals don’t overlap, so Provider A’s accuracy advantage is statistically significant at p<0.05. However, Provider B’s higher ICC means it’s more consistent—same query, more predictable results. Depending on your use case, consistency may matter more than the 2.3pp accuracy difference.

Provider C: 83.1% ± 4.8% accuracy (95% CI: 78.3-87.9%), ICC=0.42
Provider D: 79.8% ± 4.2% accuracy (95% CI: 75.6-84.0%), ICC=0.39

Provider C appears better, but those wide confidence intervals overlap substantially. More critically, both providers have ICC < 0.50, indicating that most variance is due to trial-to-trial randomness rather than query difficulty. When you see variance like this, your evaluation methodology itself needs debugging before you can trust the comparison.

This isn’t the only way to evaluate search quality, but I find it one of the most effective for balancing accuracy with feasibility. This framework delivers reproducible results that predict production performance, enabling you to compare providers on equal footing.

Right now, we are in a stage where we are relying on cherry-picked demos, and most vendor comparisons are meaningless because everyone measures differently. If you’re making million-dollar decisions about search infrastructure, you owe it to your team to measure properly.