Why Do Language Models Hallucinate?

0
10



Image by Editor | ChatGPT

 

Introduction

 
Hallucinations — the bane of the language model (LM) and its users — are the plausible-sounding but factually incorrect statements produced by LMs. These hallucinations are problematic because they can erode user trust, propagate misinformation, and mislead downstream decisions even when the output is expressed with high confidence. These hallucinations are especially troublesome in scenarios in which users can’t easily verify claims (technical answers, medical or legal summaries, data analysis), as confident delivery of the incorrect information masks underlying uncertainty, turning small modeling errors into possible high-stakes failures.

A recent paper, “Why Language Models Hallucinate” by Kalai, Nachum, Vempala, and Zhang, has taken on the task of analyzing both the statistical roots of these errors and the socio-technical incentives that keep them alive. The authors connect generative mistakes to simple classification dynamics and examine how today’s training and evaluation practices nudge models toward confident guessing rather than calibrated uncertainty. The result is a firm understanding of where hallucinations actually come from and what kinds of changes might reduce them in practice.

The paper provides several high-level and insightful revelations regarding the causes and persistence of LM hallucinations, and we are going to look at five of these.

 

1. The Root Cause of Hallucinations

 
TL;DR: Hallucinations are primarily caused by training and evaluation procedures that reward guessing over admitting uncertainty.

The core argument of the paper is that hallucinations, defined as plausible yet incorrect statements, persist because the procedures used for training and evaluation inadvertently reward confident guessing rather than the acknowledgment of uncertainty. LMs are optimized to function as “good test-takers,” meaning they guess when unsure to maximize their score under grading schemes that penalize uncertain responses (such as “I don’t know” or IDK). Under a common binary 0-1 scoring scheme, guessing when uncertain maximizes the expected score.

 

Proposed prompt to mitigate 'confident guessing' and encourage 'the acknowledgment of uncertainty'
Proposed prompt to mitigate ‘confident guessing’ and encourage ‘the acknowledgment of uncertainty’
Image by Author | Gemini

 

2. The Origins of Hallucinations

 
TL;DR: The statistical origin of hallucinations is reducible to simple errors in binary classification.

The paper demystifies hallucinations by arguing they are not mysterious but originate simply as errors in binary classification. The analysis connects generative errors (like hallucinations) to a supervised learning problem called the “Is-It-Valid (IIV)” binary classification. The statistical objective minimized during pretraining (cross-entropy loss) naturally leads to generative errors if the system cannot statistically distinguish incorrect statements from facts. This analysis shows a mathematical relationship: the generative error rate is roughly proportional to twice the IIV misclassification rate.

 

Misclassifying statements as 'valid' leads to hallucinations
Misclassifying statements as ‘valid’ leads to hallucinations
Image by Author | Gemini

 

3. Hallucinations are Inevitable

 
TL;DR: Calibrated base models are mathematically compelled to hallucinate, even with error-free training data.

The paper shows that even if the training corpus were perfect and error-free, the process of minimizing the statistical objective during pretraining would still lead the language model to generate errors. This is linked to the concept of calibration. Since errors are a natural consequence of the standard cross-entropy objective, any well-trained base model that is calibrated (meaning its predicted probabilities align with reality) must inevitably generate errors, particularly when faced with inherently unlearnable facts. Conversely, a base model that avoids errors must necessarily be miscalibrated (i.e. its uncertainty estimations must be wrong).

 

4. Hallucinations are Persistent

 
TL;DR: The persistence of hallucinations is driven by an “epidemic” of misaligned primary evaluations.

Despite post-training techniques often aiming to reduce falsehoods, hallucinations persist because the vast majority of existing, influential benchmarks and leaderboards overwhelmingly utilize binary grading systems (such as accuracy or pass-rate) that penalize abstention and uncertainty. This creates a “socio-technical” problem. If Model A correctly signals uncertainty but Model B always guesses when unsure, Model B will outperform Model A under 0-1 scoring schemes, reinforcing the hallucination-like behavior of guessing. This dominance of misaligned evaluations is the root problem, which cannot be solved simply by adding a small fraction of new hallucination-specific evaluations.

 

5. The Role of Arbitrariness

 
TL;DR: Statistical uncertainty arising from arbitrary facts (low data frequency) is a key driver of pretraining errors.

One major statistical factor contributing to pretraining errors is the existence of arbitrary facts, defined as specific, random facts where no succinct pattern explains the target function, leading to epistemic uncertainty because necessary knowledge is absent or rare in the training data. Examples include individual birthdays. The analysis shows that for arbitrary facts, the expected hallucination rate is lower-bounded by the singleton rate, or the fraction of facts appearing exactly once in the training data. For example, if 20% of birthday facts appear only once, models are expected to hallucinate on at least 20% of those facts. Other generative error factors include poor models (where the model family cannot represent the concept well, like the letter-counting example) and GIGO (Garbage In, Garbage Out, where models replicate errors from training data).

 

Key Takeaways

 
A few themes tie the paper together.

First, hallucinations aren’t mystical failures; instead, they arise from ordinary misclassifications of validity, the same kind of binary errors any classifier makes when it can’t reliably tell true from false.

Second, our dominant evaluation culture implicitly rewards confident guessing by penalizing expressions of uncertainty, so models that never say “I don’t know” look better on leaderboards even when they’re wrong.

Third, durable progress won’t come from bolt-on patches; it requires changing benchmark scoring to value calibrated uncertainty and abstention, then aligning training and deployment to those incentives.

Something to ponder: what would your information consumption look like if you rewarded people, and machines, for knowing when not to answer?
 
 

Matthew Mayo (@mattmayo13) holds a master’s degree in computer science and a graduate diploma in data mining. As managing editor of KDnuggets & Statology, and contributing editor at Machine Learning Mastery, Matthew aims to make complex data science concepts accessible. His professional interests include natural language processing, language models, machine learning algorithms, and exploring emerging AI. He is driven by a mission to democratize knowledge in the data science community. Matthew has been coding since he was 6 years old.