It is difficult to choose the best AI to help you in work and life. What about GPT-4O, 4.5, 4.1, O1, O1-Pro, O3-Mini or O3-Mini-High? If not open, you can opt for one of the many models published by Meta, Google or Anthropic.
This year has already seen at least a dozen outings of large AI companies models, and it can be confusing to decipher which really has a competitive advantage. The developers of most of these versions claimed that their AI had superior “reference” results in a way.
But this way of comparing them faced concerns whether they are not rigorous or reliable.
Earlier this month, Meta published two new models in his Llama family which, according to him, gave “best results” than comparable size models from Google and Mistral. However, Meta then faced charges that she had played a reference.
Lmarena, a reference AI that Crowdsources users vote on the performance of the model, said that Meta “should have been clearer” that he had submitted a version of Llama 4 Maverick which had been “personalized” to better perform for its test format.
“Meta’s interpretation of our policy did not correspond to what we expect from model suppliers,” said Lmarena in a post X.
A Meta spokesperson told Business Insider that “” Llama-4-MaveRick-03-26-Experimental “is a version optimized by the cat with which we have experienced which also works on Lmarena.”
They added: “We have now published our open source version and will see how developers personalize Llama 4 for their own use cases.”
The reference problem
The saga talks about broader problems than the AI industry has had more and more with references.
Companies spending billions of Dollars in AI development have a lot of horse riding to release more powerful models than the last, which, according to the cognitive scientist and AI researcher, can be problematic.
“Nowadays, with a lot of money based on the performance on references, it becomes very tempting for large technological companies to create training data that” teach the test “, then the benchmarks tend to lose even more validity,” said Marcus, who criticized the areas of the AI industry which he considers too wired, told Bi.
There is also the question of whether the benchmarks measure good things.
In a February article entitled “Can we trust AI benchmarks?” An interdisciplinary examination of current problems in the assessment of AI “, researchers from the joint research center of the European Commission have concluded that the main problems exist in today’s approach.
Researchers have said that there are “systemic defects in current reference practices”, which are “fundamentally shaped by cultural, commercial and competitive dynamics which often favor advanced performance to the detriment of wider societal concerns”.
Likewise, Dean Valentine, co -founder and CEO of the IA security startup, Zeropath, said that a March blog article that “the recent progress of the AI model is mainly like bullshit”.
In his article, Valentine said that he and her team had assessed the performance of different models claiming to have “a kind of improvement” since the release of 3.5 anthropic sonnet in June 2024.
None of the new models that his team tried had a “significant difference” in the internal references of his business or in the ability of developers to find new bugs, he said. They might have been “more fun to speak,” he added, but they did not reflect economic utility or generality “.
As he said, “if the industry cannot understand how to measure even the intellectual capacity of the models now, when they are mainly confined to the chat rooms”, it is difficult to see how precise AI could be measured in the future.
Benchmarks can be a “good compass”
Nathan Habib, an automatic learning engineer in Hugging Face, told Bi that the problem with many arena style benchmarks is that they bowed to human preference thanks to crowdsourcés votes, which means “you can optimize your sympathy model rather than capacity”.
“For the benchmarks to really serve the community, we need several guarantees: up -to -date data, reproducible results, third -party neutral assessments and protection against responses,” said Habib, pointing to the Gaia Benchmark as an example of a tool that does it.
He added that even if the benchmarks are not perfect, “these are always good compasses of the place where we should go.”
According to Marcus, there is no immediate solution. “Making very good tests is difficult, and preventing people from playing these tests can be even more difficult,” he told Bi.
He said that many tests try to measure “understanding of language”, but “it turns out that you can simulate many of these tests by memorizing many things, without having a deep understanding of language at all.”
Marcus added: “Direct risk is that customers are informed that new systems are better and spend a lot of money on this premise.”
So how should someone do to sail in the sprawling world of AI models? How can you know what is better on Deepseek-R1, Deepseek-V3, Claude 3.5 Haiku or Claude 3.7 Sonnet?
“When it comes to selecting the right model from advanced claims, don’t forget that the best model is not the one that wins each reference; it is the one that solves your specific problem with elegance,” Bi Clémentine Fourrier told Hugging Face AI, told Bi.
“Do not continue the model with the highest score; Chase the model that marks the highest on what matters to you,” she said
businessinsider