Evaluating Large Language Models. How do you know how good your LLM is? A… | by Michał Oleszak

Contents

Generative AI How do you know how good your LLM is? A complete guide.GenAI evaluation challenges

Generative AI

How do you know how good your LLM is? A complete guide.

19 min read

13 hours ago

Having gone mainstream over a year ago with the releases of Stable Diffusion and ChatGPT, generative AI is developing incredibly fast. New models claiming to beat the state-of-the-art are announced almost every week. But how do we know if they are actually any good? How do we compare and rank generative models in the absence of ground truth, the “correct” solutions? Finally, if the LLM is using external data through a Retrieval-Augmented Generation or RAG system, how do we judge whether it makes correct use of these data?

In a two-part series, we will explore evaluation protocols for generative AI. This post focuses on text generation and Large Language Models. Keep an eye out for a follow-up in which we will discuss evaluation methods for image generators.

Let’s start by noting the distinction between generative and discriminative models. Generative models generate new data samples, be it text, images, audio, video, latent representations, or even tabular data, that are similar to the model’s training data. Discriminative models, on the other hand, learn decision boundaries through the training data, allowing us to solve classification, regression, and other tasks.

GenAI evaluation challenges

Evaluating generative models is inherently more challenging than discriminative models due to the nature of their tasks. A discriminative model’s performance is relatively straightforward to measure using task-appropriate metrics such as precision for classification tasks, mean squared error for regression tasks, or intersection over union for object detection tasks.

Evaluating generative models is inherently more challenging than discriminative models due to the nature of their tasks.

In contrast, generative models aim to produce new, previously unseen content. Assessing the quality, coherence, diversity, and usefulness of these generated samples is more complex.