OpenAI has revealed that the predominant source of hallucinations in large language models (LLMs) stems from existing training and evaluation frameworks, which incentivize models to guess answers rather than admit uncertainty. The company advocates revising benchmark testing to improve the reliability and trustworthiness of these AI systems.
Training flaws create confident fabrications
In its latest research paper, OpenAI highlights that despite ongoing advancements, LLMs continue to produce hallucinations, potentially misleading users. These errors arise because during pre-training, LLMs learn probabilistic language patterns without verifying factual accuracy, leading them to provide confident but fabricated responses when precise data is unavailable.
OpenAI notes that current evaluation practices primarily measure accuracy, overlooking the model's ability to express uncertainty. This approach parallels multiple-choice exams where guessing can yield points, pressuring models to offer answers instead of acknowledging ignorance. Such a system ultimately increases the frequency of errors rather than reducing them.
GPT-5 Mini shows promise with uncertainty awareness
Demonstrating this, OpenAI compared GPT-4o Mini with an uncertainty-aware "Thinking Mode" enabled against the o1-mini model using the SimpleQA benchmark. While o1-mini achieved marginally higher accuracy, its error rate was three times greater than that of GPT-4o Mini. This suggests that attempts to boost accuracy through risky guessing lead to substantially more mistakes.
To mitigate hallucinations, OpenAI proposes modifying benchmarks by integrating negative scoring that heavily penalizes incorrect answers, alongside rewards for models that properly convey uncertainty or request clarifications. This shift aims to encourage safer and more reliable AI behavior.
Claude takes cautious approach but sacrifices usability
Additionally, OpenAI compared its models with Anthropic's Claude in 2024, finding that Claude demonstrates a higher tendency to refuse answering hallucination-prone queries—up to 70%—thus expressing greater caution. However, this also reduces its usability in practical settings, and Claude's accuracy on answered questions remains relatively low. The findings underscore ongoing challenges in balancing accuracy, uncertainty, and usability within LLM development.
Article edited by Jerry Chen