OpenAI’s o3 AI model scores lower on a benchmark than the company initially implied

mansoorkhan8185

4 hours ago

There is a contradiction between the results of the first and third -party benchmark for Openi’s O3 AI model Raising questions about the company’s transparency And model testing methods.

When Openi Mashed O3 in DecemberThe company claimed that a challenging set of model math issues could only answer the fourth questions on Frontier Rahmat. The score removed the competition-the next best model managed to properly respond to only 2 % of Frontier’s problems.

“Today, there is less than 2 % of all the offerings there [on FrontierMath]”Mark Chen, Chief Research Officer in Open, Said during a current. “We’re looking [internally]With O3 in the aggressive test time computing settings, we are able to get more than 25 %.

As it turns out, the figure was potentially upper bound, which was obtained through a version of O3, behind it was more computing behind it, compared to publicly launched a publicly launched model.

The Research Institute Apoch AI, behind the Frontier, released the results of his O3 Independent Benchmark Test on Friday. Epoch found that under OO OPNI’s most claimed score, O3 scored about 10 percent.

Open has released his highly expected reasoning model, as well as O4-mini, a small and cheap model that succeeds in O3-mini.

We reviewed new models on our mathematics and science benchmarks. Results in the thread! pic.twitter.com/5GBTZKEY1B

– Epoch Ai (@PochaireSERCH) April 18, 2025

This does not mean that the open lies, per second. The results of a company benchmark, published in December, show a low -bound score, which is similar to the observation score. Epoch also noted that his testing setup is different from Openi, and he has used the latest Frontier Release release to diagnose it.

“The difference between our results and the openness may be due to the opening with more powerful internal scaffolding, using more test time using [computing]Or because these results were run on various sub-sets of Frontier (290 issues in Frontier Raha–2024-11-26 with 290 issues in the Frontiermaat-295-02-28-29-Private, “),” Is written Promise

According to a post on x From the Arc Prize Foundation, an organization that tested the pre -release version of O3, the Public O3 Model “is a different model […] Chat/Product Use L Tun Tund, “Appreciate the Epoch report.

“All released O3 Compute Tires are smaller than our version [benchmarked]”The Arc Prize wrote. Usually speaking, it can be expected to get a better benchmark score of big computer terrace.

This is the fact that the public release of O3 is less than the promises of open testing, as the company’s O3-mini-high and O4-mini models improve O3 on the Frontiermith, and the Openi intends to start the O3-PRO, the more powerful O3 in the coming weeks.

However, this is another reminder that the AI Benchmark is not taken at the best price – especially when the source is a company that has sales services.

Benchmarking “conflicts” are becoming a common event in the AI industry as shopkeepers race is the race to get headlines and minds with new models.

In January, the covenant was Criticized Waiting to disclose funds from Openi until the company announced the O3. Many experts who participated in the Frontier Regions were not informed about the involvement of the open until it was made public.

Recently, Elon was the mask’s zie Accused Its latest AI model, to publish a misleading benchmark chart for Grook 3. Just the same month, Meta confessed to stopping the benchmark score for a version. A model that was different from the company that made the company available for developers.

Source link