Openai’s Recently launched O3 and O4-mini AI model In many cases the latest. However, new models still deceive, or make things – in fact, they deceive More Compared to several old models of Open.
AI has proved to be one of the biggest and difficult problems to solve in AI, which affects Even today’s best performing system. Historically, every new model has improved the department’s deception, which is less deceived by its predecessor. But it does not seem to be a matter for O3 and O4-mini.
According to Openi’s internal tests, O3 and O4-mini, which are models of so-called reasoning, deception More frequently Compared to the company’s previous reasoning models-O1, O1-mini, and O3-mini-nos also traditional, “irrational” models, such as GPT-4Os.
Perhaps even more, the Chat GPT maker does not really know why this is happening.
In his technical report for O3 and O4-miniOpeni writes that “more research is needed” to understand why the frauds are deteriorating because it scales the reasoning model. O3 and O4-mini perform better in some areas, including coding and mathematics. But since they “make more claims as a whole”, they often have more “more false claims as well as false/deception claims.”
Openi found that O3 had deceived 33 % of the questions on the company’s Benchmark Personka to measure the accuracy of a model’s knowledge about people. Openi’s previous reasoning models, O1 and O3-Mini’s Holocare Rate, which scored 16 % and 14.8 % respectively. O4-mini performed even even worse on personal cue-48 % time fraud.
Third party Test A non -profit AI research lab, through translosis, also found evidence that O3 has a tendency to take steps to get answers. In an example, Translos observed the O3 that he ran the code on the MacBook Pro, “Out of Chat GPT”, then copied the numbers in response. Although O3 has access to some tools, it can’t.
“Our assumption is that learning this type of reinforcement for O -series models can increase issues that usually reduce (but not completely eliminated) through post -training pipelines.”
Sarah Shotman, who is the co -founder of the Strokes, added that the O3’s fraud rate could be far more useful otherwise.
Stanford -affiliated professor and CEO of Optcling Startup Worker, Kayan Catanforosh, told Tech Crunch that his team was already examining O3 in his coding workflow, and that he was found one step above the competition. However, Katanforosh says the O3 deceives links to the broken website. The model will provide a link, when clicked, does not work.
Frauds can help models reach interesting ideas and become creative in their “thinking”, but they also sell some models hard for business in the markets where accuracy is the most important. For example, a legal firm will not be happy with a model that enters many facts in the client’s contracts.
A promising approach to enhancing the accuracy of models is to provide them with web search capabilities. Openai’s GPT4O gets with web search 90 % accuracy One of the standards of the Openi accuracy, on the symbol. Possibly, search argument models can also improve the deception rates, at least in cases where consumers are willing to expose a third -party search provider.
If the reasoning is actually scaling models. If the deception is damaged, it will make the solution more and more necessary.
Open spokesman Nico Felix told Tech Crunch in an email, “Addressing frauds in all our models is an ongoing area of research, and we are working permanently to improve their accuracy and reliability.”
In the past year, the wider AI industry has developed to focus on reasoning models Improving traditional AI model techniques began to show a decrease return. The reasoning improves the performance of the model on multiple tasks without the need for large -scale computing and data during training. Still, it seems that reasoning can also lead to more deception – to present a challenge.