about the idea of using AI to evaluate AI, also known as “LLM-as-a-Judge,” my reaction was:
“Ok, we have officially lost our minds.”
We live in a world where even toilet paper is marketed as “AI-powered.” I assumed this was just another hype-driven trend in our chaotic and fast-moving AI landscape.
But once I looked into what LLM-as-a-Judge actually means, I realized I was wrong. Let me explain.
There is one picture that every Data Scientist and Machine Learning Engineer should keep in the back of their mind, and it captures the entire spectrum of model complexity, training set size, and expected performance level:
If the task is simple, having a small training set is usually not a problem. In some extreme cases, you can even solve it with a simple rule-based approach. Even when the task becomes more complex, you can often reach high performance as long as you have a large and diverse training set.
The real trouble begins when the task is complex and you do not have access to a comprehensive training set. At that point, there is no clean recipe. You need domain experts, manual data collection, and careful evaluation procedures, and in the worst situations, you might face months or even years of work just to build reliable labels.
… this was before Large Language Models (LLMs).
The LLM-as-a-Judge paradigm
The promise of LLMs is simple: you get something close to “PhD-level” expertise in many fields that you can reach through a single API call. We can (and probably should) argue about how “intelligent” these systems really are. There is growing evidence that an LLM behaves more like an extremely powerful pattern matcher and information retriever than a truly intelligent agent [you should absolutely watch this].
However, one thing is hard to deny. When the task is complex, difficult to formalize, and you do not have a ready-made dataset, LLMs can be incredibly useful. In these situations, they give you high-level reasoning and domain knowledge on demand, long before you could ever collect and label enough data to train a traditional model.
So let’s go back to our “big trouble” red square. Imagine you have a difficult problem and only a very rough first version of a model. Maybe it was trained on a tiny dataset, or maybe it is a pre-existing model that you have not fine-tuned at all (e.g. BERT or whatever other embedding model).
In situations like this, you can use an LLM to evaluate how this V0 model is performing. The LLM becomes the evaluator (or the judge) for your early prototype, giving you immediate feedback without requiring a large labeled dataset or the huge effort we mentioned earlier.

This would have many beneficial downstream applications:
- Evaluating the state of the V0 and its performance
- Building a training set to improve the existing model
- Monitoring the stage of the existing model or the fine-tuned version (following point 2).
So let’s build this!
LLM-as-a-Judge in Production
Now there is a fake syllogism: as you don’t have to train an LLM and they are intuitive to use on the ChatGPT/Anthropic/Gemini UI, then it must be easy to build an LLM system. That is not the case.
If your goal is not a simple plug-and-play feature, then you need active effort to make sure your LLM is reliable, precise, and as hallucination-free as possible, designing it to fail gracefully when it fails (not if but when).
Here are the main topics we will cover to build a production-ready LLM-as-a-Judge system.
- System design
We will define the role of the LLM, how it should behave, and what perspective or “persona” it should use during evaluation. - Few-shot examples
We will give the LLM concrete examples that show exactly how the evaluation should look for different test cases. - Triggering Chain-of-Thought
We will ask the LLM to produce notes, intermediate reasoning, and a confidence level in order to trigger a more reliable form of Chain-of-Thought. This encourages the model to actually “think.” - Batch evaluation
To reduce cost and latency, we will send multiple inputs at once and reuse the same prompt across a batch of examples. - Output formatting
We will use Pydantic to enforce a structured output schema and provide that schema directly to the LLM, which makes integration cleaner and production-safe.
Let’s dive in the code! 🚀
Code
The whole code can be found in the following GitHub page [here]. I’m going to go through the main parts of it in the following paragraph.
1. Setup
Let’s start with some housekeeping.
The dirty work of the code is done using OpenAI and wrapped using llm_judge. For this reason, everything you need to import is the following block:
Note: You will need the OpenAI API key.
All the production-level code is handled on the backend (thank me later). Let’s carry on.
2. Our Use Case
Let’s say we have a sentiment classification model that we want to evaluate. The model takes customer reviews and predicts: Positive, Negative, or Neutral.
Here’s sample data our model classified:
For each prediction, we want to know:
– Is this output correct?
– How confident are we in that judgment?
– Why is it correct or incorrect?
– How would we score the quality?
This is where LLM-as-a-Judge comes in. Notice that ground_truth is actually not in our real-world dataset; this is why we are using LLM in the first place. 🙃
The only reason you see it here is to display the classifications where our original model is underperforming (index 2 and index 3)
Note that in this case, we are pretending to have a weaker model in place with some errors. In a real case scenario, this happens when you use a small model or you adapt a non fine-tuned deep learning model.
3. Role Definition
Just like with any prompt engineering, we need to clearly define:
1. Who is the judge? The LLM will act like one, so we need to define their expertise and background
2. What are they evaluating? The specific task we want the LLM to evaluate.
3. What criteria should they use? What the LLM has to do to determine if an output is good or bad.
This is how we are defining this:
Some recipe notes: Use clear indications. Provide what you want the LLM to do (not what you want it not to do). Be very specific in the evaluation procedure.
4. ReAct Paradigm
The ReAct pattern (Reasoning + Acting) is built into our framework. Each judgment includes:
1. Score (0-100): Quantitative quality assessment
2. Verdict: Binary or categorical judgment
3. Confidence: How certain the judge is
4. Reasoning: Chain-of-thought explanation
5. Notes: Additional observations
This enables:
– Transparency: You can see why the judge made each decision
– Debugging: Identify patterns in errors
– Human-in-the-loop: Route low-confidence judgments to humans
– Quality control: Track judge performance over time
5. Few-shot examples
Now, let’s provide some more examples to make sure the LLM has some context on how to evaluate real-world cases:
We will put these examples with the prompt so the LLM will learn how to perform the task based on the examples we give.
Some recipe notes: Cover different scenarios: correct, incorrect, and partially correct. Show score calibration (100 for perfect, 20-30 for clear errors, 60 for debatable cases). Explain the reasoning in detail. Reference specific words/phrases from the input
6. LLM Judge Definition
The whole thing is packaged in the following block of code:
Just like that. 10 lines of code. Let’s use this:
7. Let’s run!
This is how to run the whole LLM Judge API call:
So we can immediately see that the LLM Judge is correctly judging the performance of the “model” in place. In particular, it is identifying that the last two model outputs are incorrect, which is what we expected.
While this is good to show that everything is working, in a production environment, we can’t just “print” the output in the console: we need to store it and make sure the format is standardized. This is how we do it:
And this is how it looks.
Note that we are also “batching”, meaning we are sending multiple pieces of input at once. This saves cost and time.
8. Bonus
Now, here is the kicker. Say you have a completely different task to evaluate. Say you want to evaluate the chatbot response of your model. The whole code can be refactored using a few lines:
As two different “judges” change only based on the prompts we provide the LLM with, the modifications between two different evaluations are extremely straightforward.
Conclusions
LLM-as-a-Judge is a simple idea with a lot of practical power. When your model is rough, your task is complex, and you do not have a labeled dataset, an LLM can help you evaluate outputs, understand mistakes, and iterate faster.
Here is what we built:
- A clear role and persona for the judge
- Few-shot examples to guide its behavior
- Chain-of-Thought reasoning for transparency
- Batch evaluation to save time and cost
- Structured output with Pydantic for production use
The result is a flexible evaluation engine that can be reused across tasks with only minor changes. It is not a replacement for human evaluation, but it provides a strong starting point long before you can collect the necessary data.
Before you head out
Thank you again for your time. It means a lot ❤️
My name is Piero Paialunga, and I’m this guy here:

I’m originally from Italy, hold a Ph.D. from the University of Cincinnati, and work as a Data Scientist at The Trade Desk in New York City. I write about AI, Machine Learning, and the evolving role of data scientists both here on TDS and on LinkedIn. If you liked the article and want to know more about machine learning and follow my studies, you can:
A. Follow me on Linkedin, where I publish all my stories
B. Follow me on GitHub, where you can see all my code
C. For questions, you can send me an email at piero.paialunga@hotmail