Meta's benchmarks for its new AI models are a bit misleading

One of the New flagship AI model Released on Meta Saturday, Mauric, LM is second on ArenaA test in which human respondents compare the results of the models and choose who they prefer. But it seems that the version of the Maurock, which is stationed at Meta LM Arena, is different from the version that is widely available to developers.

As if Several Appearance Researchers Identifying X, Meta noted in its announcement that the extraordinary on LM Arena is an “experimental chat version”. A chart on Official Lama websiteMeanwhile, it reveals that Meta LM Arena testing was done using “Lama 4 Optim Optimized”.

As we have written beforeFor various reasons, LM Arena has never been the most reliable step in the performance of the AI model. But AI companies have generally not customized their models to score a better score on LM Arena or have not done otherwise well-or at least have not admitted to doing so.

The problem with making a model a benchmark, stopping it and then releasing the “vanilla” of the same model is that it makes it difficult for developers to predict to what extent the model will perform better in particular context. It is also misleading. Ideally, benchmark- Badly inadequate as they are – Provide a snapshot of a model’s strength and weaknesses.

Really, the researchers on X have the Observed Stark Differences in behavior Public downloadable downloads compared to the model hosted on LM Arena. It seems that the LM Arena version uses a lot of emojis, and gives incredibly long -lasting responses.

Well Lalama 4 is a Lated cooked lole, what is this yap city? pic.twitter.com/y3gvhbvz65

– Nathan Lambert (@Natolambert) April 6, 2025

For some reason, Lama 4 models in Arena use too much amojis

Ai together, it looks better: pic.twitter.com/F74Dx4zt

– Tech Dev notice (Techdevnotes) April 6, 2025

We have reached Meta and Chat Boat Arena, a LM Arena organization for comment.

Source link

Meta’s benchmarks for its new AI models are a bit misleading

Leave a Reply Cancel reply