
OpenAI says its latest models, o3 and o4-mini, are its most powerful yet. However, research shows the models also hallucinate more — at least twice as much as earlier models.
Also: How to use ChatGPT: A beginner’s guide to the most popular AI chatbot
In the system card, a report that accompanies each new AI model, and published with the release last week, OpenAI reported that o4-mini is less accurate and hallucinates more than both o1 and o3. Using PersonQA, an internal test based on publicly available information, the company found o4-mini hallucinated in 48% of responses, which is three times o1’s rate.
While o4-mini is smaller, cheaper, and faster than o3, and, therefore, wasn’t expected to outperform it, o3 still hallucinated in 33% of responses, or twice the rate of o1. Of the three models, o3 scored the best on accuracy.
Also: OpenAI’s o1 lies more than any major AI model. Why that matters
“o3 tends to make more claims overall, leading to more accurate claims as well as more inaccurate/hallucinated claims,” OpenAI’s report explained. “More research is needed to understand the cause of this result.”
Hallucinations, which refer to fabricated claims, studies, and even URLs, have continued to plague even the most cutting-edge advancements in AI. There is currently no perfect solution for preventing or identifying them, though OpenAI has tried some approaches.
Additionally, fact-checking is a moving target, making it hard to embed and scale. Fact-checking involves some level of human cognitive skills that AI mostly lacks, like common sense, discernment, and contextualization. As a result, the extent to which a model hallucinates relies heavily on training data quality (and access to the internet for current information).
Minimizing false information in training data can lessen the chance of an untrue statement downstream. However, this technique doesn’t prevent hallucinations, as many of an AI chatbot’s creative choices are still not fully understood.
Overall, the risk of hallucinations tends to reduce slowly with each new model release, which is what makes o3 and o4-mini’s scores somewhat unexpected. Though o3 gained 12 percentage points over o1 in accuracy, the fact that the model hallucinates twice as much suggests its accuracy hasn’t grown proportionally to its capabilities.
Also: My two favorite AI apps on Linux – and how I use them to get more done
Like other recent releases, o3 and o4-mini are reasoning models, meaning they externalize the steps they take to interpret a prompt for a user to see. Last week, independent research lab Transluce published its evaluation, which found that o3 often falsifies actions it can’t take in response to a request, including claiming to run Python in a coding environment, despite the chatbot not having that ability.
What’s more, the model doubles down when caught. “[o3] further justifies hallucinated outputs when questioned by the user, even claiming that it uses an external MacBook Pro to perform computations and copies the outputs into ChatGPT,” the report explained. Transluce found that these false claims about running code were more frequent in o-series models (o1, o3-mini, and o3) than GPT-series models (4.1 and 4o).
This result is especially confusing because reasoning models take longer to provide more thorough, higher-quality answers. Transluce cofounder Sarah Schwettmann even told TechCrunch that “o3’s hallucination rate may make it less useful than it otherwise would be.”
Also: Chatbots are distorting news – even for paid users
The report from Transluce said: “Although truthfulness issues from post-training are known to exist, they do not fully account for the increased severity of hallucination in reasoning models. We hypothesize that these issues might be intensified by specific design choices in o-series reasoning models, such as outcome-based reinforcement learning and the omission of chains-of-thought from previous turns.”
Last week, sources inside OpenAI and third-party testers confirmed the company has drastically minimized safety testing for new models, including o3. While the system card shows o3 and o4-mini are “approximately on par” with o1 for robustness against jailbreak attempts (all three score between 96% and 100%), these hallucination scores raise questions about the non-safety-related impacts of changing testing timelines.
The onus is still on users to fact-check any AI model’s output. This strategy appears wise when using the latest-generation reasoning models.