Retrieval, a cornerstone of Generative AI systems, is still challenging. Retrieval Augmented Generation, or RAG for short, is an approach to building AI-powered chatbots that answer questions based on data the AI model, an LLM, has been trained on.
Evaluation data from sources like WikiEval show very low natural language retrieval accuracy. This means you will probably need to conduct experiments to tune RAG parameters for your GenAI system before deploying it. However, before you can do RAG experimentation, you need a way to evaluate which experiments had the best results!
Using Large Language Models (LLMs) as judges has gained prominence in modern RAG evaluation. This approach involves using powerful language models, like OpenAI’s GPT-4, to assess the quality of components in RAG systems. LLMs serve as judges by evaluating the relevance, precision, adherence to instructions, and overall quality of the responses produced by the RAG system.