about a failure that turned into something interesting.
For months, I — along with hundreds of others — have tried to build a neural network that could learn to detect when AI systems hallucinate — when they confidently generate plausible-sounding nonsense instead of actually engaging with the information they were given. The idea is straightforward: train a model to recognize the subtle signatures of fabrication in how language models respond.
But it didn’t work. The learned detectors I designed collapsed. They found shortcuts. They failed on any data distribution slightly different from training. Every approach I tried hit the same wall.
So I gave up on “learning”. And I started to think, why we do not convert this into a geometry problem? And this is what I did.
Backing Up
Before I get into the geometry, let me explain what we’re dealing with. Because “hallucination” has become one of those terms that means everything and nothing. Here’s the specific situation. You have a Retrieval-Augmented Generation system — a RAG system. When you ask it a question, it first retrieves relevant documents from some knowledge base. Then it generates a response that’s supposed to be grounded in those documents.
- The promise: answers backed by sources.
- The reality: sometimes the model ignores the sources entirely and generates something that sounds reasonable but has nothing to do with the retrieved content.
This matters because the whole point of RAG is trustworthiness. If you wanted creative improvisation, you wouldn’t bother with retrieval. You’re paying the computational and latency cost of retrieval specifically because you want grounded answers.
So: can we tell when grounding failed?
Sentences on a Sphere
LLMs represent text as vectors. A sentence becomes a point in high-dimensional space — 768 embedding dimensions for the first models, though the specific number doesn’t matter much (DeepSeek-V3 and R1 have an embedding size of 7,168). These embedding vectors are normalized. Every sentence, regardless of length or complexity, gets projected onto a unit sphere.
Once we think in this projection, we can play with angles and distances on the sphere. For example, we expect that similar sentences cluster together. “The cat sat on the mat” and “A feline rested on the rug” end up near each other. Unrelated sentences end up far apart. This clustering is how the embedding models are trained.
So now consider what happens in RAG. We have three pieces of text (Figure 1):
- The question, q (one point on the sphere)
- The retrieved context, c another point)
- The generated response, r (a third point)
Three points on a sphere form a triangle. And triangles have geometry (Figure 2).
The Laziness Hypothesis
When a model uses the retrieved context, what should happen? The response should depart from the question and move toward the context. It should pick up the vocabulary, framing, and concepts from the source material. Geometrically it implies that the response should be closer to the context than to the question (Figure 1).
But when a model hallucinates — when it ignores the context and generates something from its own parametric knowledge — the response stay in the question’s neighborhood. It continues the question’s semantic framing without venturing into unfamiliar territory. I called this semantic laziness. The response doesn’t travel. It stays home. Figure 1 illsutrates the laziness signature. Question q, context c, and response r, form a triangle on the unit sphere. A grounded response ventures toward the context; a hallucinated one stays home near the question. The geometry is high-dimensional, but the intuition is spatial: did the response actually go anywhere?
Semantic Grounding Index
To measure this, I defined a ratio:

and I called it Semantinc Grounding Index or SGI.

If SGI is greater than 1, the response departed toward the context. If SGI is less than 1, the response stayed close to the question, meaning that model isn’t able to find a way to explare the answers space and stays too close to the question (a kind of safety state). The SGI has just two angles and a division. No neural networks, no learned parameters, no training data. Pure geometry.

Does It Actually Work?
Simple ideas need empirical validation. I ran this on 5,000 samples from HaluEval, a benchmark where we know ground truth — which responses are genuine and which are hallucinated.

I ran the same analysis with five completely different embedding models. Different architectures, different training procedures, different organizations — Sentence-Transformers, Microsoft, Alibaba, BAAI. If the signal were an artifact of one particular embedding space, these models would disagree. They didn’t disagree. The average correlation across models was r = 0.85 (from 0.80 to 0.95).

When the Math Predicted Something
Up to this point, I had a useful heuristic. Useful heuristics are fine. But what happened next turned a heuristic into something more principled. The triangle inequality. You probably remember this from school: the sum of any two sides of a triangle must be greater than the third side. This constraint applies on spheres too, though the formula looks slightly different.

If the question and context are very close together — semantically similar — then there isn’t much “room” for the response to differentiate between them. The geometry forces the angles to be similar regardless of response quality. SGI values get squeezed toward 1. But when the question and context are far apart on the sphere? Now there’s geometric space for divergence. Valid responses can clearly depart toward the context. Lazy responses can clearly stay home. The triangle inequality loosens its grip.
This implies a prediction:
SGI’s discriminative power should increase as question-context separation increases.
The results confirms this prediction: monotonic increase. Exactly as the triangle inequality predicted.
| Question-Context Separation | Effect Size (d) | AUC |
| Low (similar) | 0.61 | 0.72 |
| Medium | 0.90 | 0.77 |
| High (different) | 1.27 | 0.83 |
This difference carries epistemic weight. Observing behaviour in data after the fact offers weak evidence — such baehaviour may reflect noise or analyst degrees of freedom rather than genuine structure. The stronger test is prediction: deriving what should happen from basic principles before examining the data. The triangle inequality implied a specific relationship between θ(q,c) and discriminative power. The empirical results confirmed it.
Where It Doesn’t Work
TruthfulQA is a benchmark designed to test factual accuracy. Questions like “What causes the seasons?” with correct answers (“Earth’s axial tilt”) and common misconceptions (“Distance from the Sun”). I ran SGI on TruthfulQA. The result: AUC = 0.478. Slightly worse than random guessing.
Angular geometry captures topical similarity. “The seasons are caused by axial tilt” and “The seasons are caused by solar distance” are about the same topic. They occupy nearby regions on the semantic sphere. One is true and one is false, but they’re both responses that engage with the astronomical content of the question.
SGI detects whether a response departed toward its sources. It cannot detect whether the response got the facts right. These are fundamentally different failure modes. It’s a scope boundary. And knowing your scope boundaries is arguably more important than knowing where your method works.
What This Means Practically
If you’re building RAG systems, SGI correctly ranks hallucinated responses below valid ones about 80% of the time — without any training or fine-tuning.
- If your retrieval system returns documents that are semantically very close to the questions, SGI will have limited discriminative power. Not because it’s broken, but because the geometry doesn’t permit differentiation. Consider whether your retrieval is actually adding information or just echoing the query.
- Effect sizes roughly doubled for long-form responses compared to short ones. This is precisely where human verification is most expensive — reading a five-paragraph response takes time. Automated flagging is most valuable exactly where SGI works best.
- SGI detects disengagement. Natural language inference detects contradiction. Uncertainty quantification detects model confidence. These measure different things. A response can be topically engaged but logically inconsistent, or confidently wrong, or lazily correct by accident. Defense in depth.
The Scientific Question
I have a hypothesis about why semantic laziness happens. I want to be honest that it’s speculation — I haven’t proven the causal mechanism.
Language models are autoregressive predictors. They generate text token by token, each choice conditioned on everything before. The question provides strong conditioning — familiar vocabulary, established framing, a semantic neighborhood the model knows well.
The retrieved context represents a departure from that neighborhood. Using it well requires confident bridging: taking concepts from one semantic region and integrating them into a response that started in another region.
When a LLM is uncertain about how to bridge, the path of least resistance is to stay home. Models generate something fluent that continues the question’s framing without venturing into unfamiliar territory because is statistically safe. As a consequence, the model becomes semantically lazy.
If this is right, SGI should correlate with internal model uncertainty — attention patterns, logit entropy, that sort of things. Low-SGI responses should show signatures of hesitation. That’s a future experiment.
Takeaways
- First: simple geometry can reveal structure that complex learned systems miss. I spent months trying to train hallucination detectors. The thing that worked was two angles and a division. Sometimes the right abstraction is the one that exposes the phenomenon most directly, not the one with the most parameters.
- Second: predictions matter more than observations. Finding a pattern is easy. Deriving what pattern should exist from first principles, then confirming it — that’s how you know you’re measuring something real. The stratified analysis wasn’t the most impressive number in this work, but it was the most important.
- Third: boundaries are features, not bugs. SGI fails completely on TruthfulQA. That failure taught me more about what the metric actually measures than the successes did. Any tool that claims to work everywhere probably works nowhere reliably.
Honest Conclusion
I’m not sure if semantic laziness is a deep truth about how language models fail, or just a useful approximation that happens to work for current architectures. The history of machine learning is littered with insights that seemed fundamental and turned out to be contingent.
But for now, we have a geometric signature of disengagement: a practical “hallucinations” detector. It’s consistent across embedding models. It’s predictable from mathematical first principles. And it’s cheap to compute.
That feels like progress.

Note: The scientific paper with complete methodology, statistical analyses, and reproducibility details is available at https://arxiv.org/abs/2512.13771.
You can cite this work in BibText as:
@misc{marín2025semanticgroundingindexgeometric,
title={Semantic Grounding Index: Geometric Bounds on Context Engagement in RAG Systems},
author={Javier Marín},
year={2025},
eprint={2512.13771},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2512.13771},
}
Javier Marin is an independent AI researcher based in Madrid, working on reliability assessment for production AI systems. He tries to be honest about what he doesn’t know. You can contact Javier at [email protected]. Any contribution will be wellcomed!
References
- Azaria, A. and Mitchell, T. (2023). The internal state of an LLM knows when it’s lying. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 967–976.
- Bao, F., Chen, Y., and Wang, X. (2025). FaithBench: A diverse hallucination benchmark for summarization by modern LLMs. arXiv preprint arXiv:2501.00942.
- Bridson, M.R. and Haefliger, A. (2013). Metric Spaces of Non-Positive Curvature, volume 319 of Grundlehren der mathematischen Wissenschaften. Springer-Verlag, Berlin.
- Catak, F.O., Kuzlu, M., and Guler, O. (2024). Uncertainty quantification in large language models through convex hull analysis. arXiv preprint arXiv:2406.19712.
- Firth, J.R. (1957). A synopsis of linguistic theory, 1930–1955. In Studies in Linguistic Analysis, pages 1–32. Blackwell,Oxford.
- Fisher, R.A. (1953). Dispersion on a sphere. Proceedings of the Royal Society of London. Series A, 217(1130):295–305.
- Guu, K., Lee, K., Tung, Z., Pasupat, P., and Chang, M.-W. (2020). REALM: Retrieval-augmented language model pre-training. In Proceedings of the 37th International Conference on Machine Learning, pages 3929–3938.
- Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., and Liu, T. (2023). A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems.
- Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., and Fung, P. (2023). Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38.
- Kovács, Á. and Recski, G. (2025). LettuceDetect: A hallucination detection framework for RAG applications. arXiv preprint arXiv:2502.17125. 10 A PREPRINT — DECEMBER 15, 2025
- Kuhn, L., Gal, Y., and Farquhar, S. (2023). Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. In The Eleventh International Conference on Learning Representations.
- Li, X., Wang, Y., and Chen, Z. (2025). Semantic volume estimation for uncertainty quantification in language models. arXiv preprint arXiv:2501.08765.
- Meng, Y., Huang, J., Zhang, G., and Han, J. (2019). Spherical text embedding. In Advances in Neural Information Processing Systems, volume 32, pages 8208–8217.
- Pestov, V. (2000). On the geometry of similarity search: Dimensionality curse and concentration of measure. Information Processing Letters, 73(1–2):47–51.
- Wang, T. and Isola, P. (2020). Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In Proceedings of the 37th International Conference on Machine Learning, pages 9929–9939.