By OPENAI’s own tests, its new models of reasoning, O3 and O4-Mini, hallucinous significantly greater than the O1.
Reported by Techcrunch for the first time, the Openai System Card detailed the results of the Personqa assessment, designed to test hallucinations. According to the results of this assessment, the O3 hallucination rate is 33% and the hallucination rate of O4 -Mini is 48% – almost half of the time. In comparison, the hallucination rate of O1 is 16%, which means hallucinated O3 about twice more often.
All the news of the AI of the week: Chatgpt is making its O3 and O4-Min debut, Gemini speaks to the dolphins
The system card noted how O3 “tends to make more complaints overall, leading to more precise allegations as well as more inaccurate / hallucinated complaints”. But Openai does not know the underlying cause, simply saying: “Additional research is necessary to understand the cause of this result”.
Openai’s reasoning models are considered more precise than its seasonal models like GPT-4O and GPT-4.5 because they use more calculation to “spend more time thinking before responding”, as described in the O1 announcement. Rather than based largely on stochastic methods to provide an answer, the models of the O series are trained to “refine their reflection process, try different strategies and recognize their mistakes”.
However, the system card for GPT-4.5, which was published in February, shows a 19% hallucination rate on the personqa assessment. The same card also compares it to GPT-4O, which had a 30%hallucination rate.
Mashable lighting speed
The evaluation references are difficult. They can be subjective, especially if they are developed internally, and research has found faults in their data sets and even the way they assess the models.
In addition, some are based on different benchmarks and methods to test precision and hallucinations. The Huggingface Hallucination benchmark assesses the models on the “occurrence of hallucinations in the summaries generated” from around 1,000 public documents and has found much lower hallucination rates for all models on the market than Openai assessments. The GPT-4O marked 1.5%, GPT-4.5 previews 1.2%and O3-Mini-High with the reasoning obtained a score of 0.8%. It should be noted that O3 and O4-Mini were not included in the current classification.
That is to say; Even the standard industry benchmarks make it difficult to assess hallucination rates.
Then, there is the additional complexity that models tend to be more precise when the web research appears to obtain their responses. But to use chatgpt research, OPENAI data sharing with third -party research providers and business customers using internal OPENAI models may not be willing to expose their prompts to this.
Be that as it may, if Openai says that their models O3 and O4-Mini and O4-Mini are higher than their unpleasant models, this could be a problem for its users. Mashable contacted Openai and will update this story with an answer.