Earlier this week, Meta Descended in hot water The crowded benchmark, to use the experimental, unmanned version of its Lama 4 extravagant model to get high scores on LM Arena. Incident LM Arena’s caregivers indicated to apologizeChange their policies, and score unmanned, vanilla more.
It turns out, it’s not too competitive.
Non -edited maracing, “Lalama -4 -Mauric 17 B -128 E -instructor,” The models were classified below Until Friday, Openi’s GPT -4O, Anthropic’s Claude 3.5 Sant, and Google’s Gemini 1.5 Pro. Many of these models are months.
The release version of Lama 4 has been included in Lemrina when he found out that he cheated, but you probably didn’t see it because you have to scroll down at the 32nd position where the rating is. pic.twitter.com/a0bxkdx4lx
– P: ɡsn (pigeon__s) April 11, 2025
Why poor performance? The company explained in A, Meta’s experimental, Lama-4-4-maverick -03-26- experimental, “improved for dialogue.” The chart appeared Last week, these reforms clearly played well with LM Arena, in which the human radius compares the output of models and select the what they prefer.
As we have written beforeFor various reasons, LM Arena has never been the most reliable step in the performance of the AI model. Nevertheless, making a model a benchmark – besides being misleading – makes it difficult for developers to predict how the model will perform better in different contexts.
In a statement, a Meta spokesperson told the Tech Crunch that the Meta experiments with “all kinds of customs variations”.
The spokesperson said, “Llama-4-maverick-03-26- experimental ‘is a chat optimized version with which we have experienced that LM Arena also performs well. “” Now we have released our open source version and will see how the developers customize Lama 4 for their use matters. We are excited to see what they will make and are waiting for their ongoing opinion. “