From Possible to Probable AI Models

Contents

1. Dimensionality and the Space of Possibilities 2. Frequentist measurements vs Bayesian expectations 3. Confidence is not the same as probability 4. The Law of Large Numbers and why more data does not automatically mean more truth 5. Stochasticity is not necessarily creativity 6. Moving from possible to reliable Final Thoughts

of years, I have been involved in many conversations about generative AI (and you probably have, too!). These conversations varied in focus, from ones with the general public about the use of AI to ones with more technical people about the accuracy of models. Regardless of who I am conversing with, people are often fascinated and curious about what models can do.

Can an LLM write a functional kernel driver? It can. Can it write a song about how much you love your cat? It sure can. Can a diffusion model generate a photo-realistic image of a medieval astronaut? It can.

But, does “can” mean it will be good? Turns out what is “possible” for most models can be a surprisingly low bar.

As someone who has studied probability or statistics, you probably know that in a sufficiently large sample space, almost anything becomes possible. The challenge is not determining whether an outcome can happen; it is understanding how likely that outcome is and whether we can depend on it repeatedly.

That right there is something many confuse about probability theory: whether it is related to generative AI. That distinction matters because building a production AI system is very different from building a demo. Demos thrive on interesting edge cases. Production systems depend on consistency.

As AI systems become an increasingly huge and important part of workflows and decision-making, it is worth revisiting fundamental ideas from probability theory and examining where common assumptions about AI reliability begin to break down.

1. Dimensionality and the Space of Possibilities

To be fair, talking about reliable systems is so much easier than building them. To understand why reliability remains very difficult, it helps to take a step back and think about sample spaces. Let’s start with the simplest of cases, a coin flip. For a coin toss: $\Omega = \{H, T\}$ . The possible outcomes are easy to visualize because there is a small space of possibilities.

Now consider a language model generating a sequence of 512 tokens with a vocabulary of 50,000 possible tokens, which gives a sample space of size $50000^{512}$ . The size of this sample space is almost impossible to comprehend, let alone visualize (in your head or in practice).

In such cases, where we have a large space, the region corresponding to useful, coherent, and factually correct outputs can become surprisingly small relative to the number of plausible alternatives. In other words, the sea of possible outcomes, what is probable is a pond…

When the model returns an answer that it is possible, but not probable, we call it a hallucination. And a hallucination, then, is not necessarily a software bug. Instead, it happens because the model is sampling from regions of the distribution with non-zero probability but little practical value.

At first glance, you may think:

“If we simply collect more data, hallucinations will disappear.”

But the challenge is that hallucinations naturally arise in probabilistic systems. Sampling from a distribution always introduces the possibility of landing in low-probability regions.

Image by the author

2. Frequentist measurements vs Bayesian expectations

When evaluating AI systems, there are often two very different approaches. The first is, more or less, a frequentist perspective: you run 1000 benchmark tasks and measure performance. If a model solves 850 correctly, we call it an 85% accurate system.

The second is a Bayesian perspective, where you start with expectations about how an intelligent system should behave and update those beliefs when unexpected failures happen.

This difference becomes important because prompts are rarely independent events. Suppose a model answers nine math questions correctly. Based on that, we may assume the probability of getting question ten right is its reported accuracy.

But language models are not a collection of isolated Bernoulli trials. Their outputs depend on previous context, hidden representations, and the density of related examples within the training distribution.

Which means their performance is often conditional rather than static.

3. Confidence is not the same as probability

One of the most commonly used functions in machine learning is the Softmax function. We often interpret Softmax outputs as confidence scores: “If the model outputs 0.90 for cat, it is 90% sure.” But this interpretation can be misleading.

Okay, step back for a second: the Softmax function states that because of the exponential term, small differences between logits can be amplified.

So, a model can appear highly confident not because it “knows” something, but because one logit happened to be slightly larger than the others and the exponential operation amplified the difference.

So when ChatGPT predicts the next word, what it is essentially doing is answering:

“Of all possible tokens, after Softmax, which one is most likely?”

This creates what I think of as the “confident fool” problem: a system confidently asserting something incorrect because it has not learned how to express uncertainty.

4. The Law of Large Numbers and why more data does not automatically mean more truth

The Law of Large Numbers states that as sample sizes increase, observed averages approach their expected values. This idea often motivates the use of extremely large datasets to train our models. After all, if a model sees enough examples, eventually it should learn the truth, right?

At first glance, this sounds reasonable, mainly because that is how we learn! But there is an important assumption hidden in the Law of Large Numbers: the underlying distribution must remain relatively stable.

Human knowledge and language are not stable distributions. They change continuously and contain contradictions, biases, and inaccuracies. Spoken language varies from one area to another. Even within the same city, people would use the same language, the same expressions, and the same words differently.

As a result, the model does not necessarily converge toward “truth.” Instead, it converges toward dominant patterns. So, if a misconception appears frequently enough in the data, the model may learn it because, statistically, it becomes the most probable continuation.

5. Stochasticity is not necessarily creativity

Many often describe AI systems as “creative” when they produce surprising outputs. However, from a probabilistic perspective, something else may be happening.

Temperature sampling changes the likelihood that the model selects less probable tokens. Samples with low temperature are predictable and safe! Those with high temperature tend to be more diverse and surprising, often leading to a greater risk of hallucination.

So, increasing the temperature sampling effectively flattens the probability distribution. Which means lower-probability outcomes will be sampled more frequently. What we sometimes interpret as creativity may instead be the model exploration of less likely regions of the distribution.

6. Moving from possible to reliable

If our goal is to build AI systems that consistently work in real environments, we need to move beyond asking whether something is possible and focus on reliability. Again, easier said than done. But, some useful approaches to do that include:

1- Using techniques such as Platt Scaling and Isotonic Regression to help align confidence scores with observed performance.

2- Using methods such as Bayesian neural networks or Monte Carlo Dropout to help quantify what a model does not know.

3- Using external validation methods to enforce output structure and requirements, rather than assuming the model will naturally follow rules.

Final Thoughts

A few years ago, everyone was impressed by AI systems that simply predicted the next word. Now we are discovering that predicting the next word is only part of the problem.

The harder challenge is predicting the right word repeatedly and reliably. Especially with new models popping up every day. With impressive models and many promises of a great performance. So, next time you see an impressive AI demo, I encourage you to ask (yourself or the person presenting the model):

“Is this what the model typically does, or is this a particularly lucky sample?”

In a world with nearly infinite possibilities, almost anything can happen. Engineering, however, is rarely about what can happen. It is about what you can trust to happen again.