A difference between the reference results of the first and third party for the O3 model of OPENAI, raises questions on transparency and modeling practices of the company.
When Openai unveiled the O3 in December, the company said that the model could answer a little more than a quarter of questions about Frontitieramath, a difficult set of mathematical problems. This score exploded the competition – the best following model managed to respond correctly only about 2% of the problems of Fronttermath.
“Today, all offers have less than 2% (on FrontierMath),” said Mark Chen, director of research in Openai, during a livestream. “We see (internally), with O3 in aggressive test calculation parameters, we can get more than 25%.”
It turns out that this figure was probably a higher limit, made by a version of O3 with more calculation behind it than the OpenAi model was launched publicly last week.
Epoch AI, the Research Institute behind FrontierMath, published the results of its independent O3 reference tests on Friday. Epoch noted that O3 scored around 10%, well below the most claimed score in Openai.
Openai published O3, their highly anticipated model of reasoning, with O4-Mini, a smaller and cheaper model that succeeds O3-Mini.
We have evaluated the new models on our suite of mathematical and scientific references. Results in the thread! pic.twitter.com/5gbtzkey1b
– Epoch ai (@Epochairesearch) April 18, 2025
This does not mean that Openai lied, in itself. The reference results published by the company in December show a score with the lower limit which corresponds to the score observed by the time. Epoch also noted that its test configuration probably differs from Openai, and that it used an updated version of Frontitierart for its assessments.
“The difference between our results and the OpenAi could be due to the evaluation of Openai with a more powerful internal scaffolding, using more test time (IT), or because these results were carried out on a different subset of Frontitiarart (the 180 problems in FrontierMath-2024-11-26 vs the 290 problems of Frontimermath-2025-02-28-Private), “wrote Ipoch.
According to an article on X of the Arc Prize Foundation, an organization that has tested a pre-liberation version of O3, the O3 public model “is a different model (…) set for cats / product use”, corroborating the Epoch report.
“All O3 computing levels published are smaller than the version that we (compatible),” wrote the ARC Prix. In general, we can expect that larger calculation levels get better reference scores.
Admittedly, the fact that O3’s public release does not become OpenAi test promises is a bit questionable, because the O3-Mini-mini models of the company and O4-Mini surpass the O3 on FrontierMath, and Openai plans to start a more powerful O3 variant, O3-Pro, in the coming weeks.
However, there is another reminder that AI’s references are not better taken at its nominal value – especially when the source is a company with services for sale.
Comparative analysis of “controversies” becomes a current event in the AI industry while sellers rush to make the headlines and a spirit with new models.
In January, Epoch was criticized for waiting to disclose the financing of Openai until after the O3 announcement. Many academics who have contributed to Frontitatitiermath have not been informed of the involvement of Openai until it was made public.
More recently, the XAI of Elon Musk has been accused of having published deceived reference graphics for its latest AI model, Grok 3. This month, Meta admitted to having vomit the reference scores for a version of a model that differs from that which the company had made available to developers.