A new, challenging AGI test stumps most AI models

The ARC Prize Foundation, a non -profit, on a jointly based on AI’s prominent researcher Francis Cholat, announced that in a Blog Post On Monday, it has created a new, challenging test to measure the general intelligence of the leading models of AI.

So far, the new test, called the Arc Agi-2, has stumped most models.

“Argument” AI model such as Openi’s O1-PRO and Depsteek R1 score between 1 % and 1.3 % on Arc-AGI-2 Arc Prize Leader Board. Powerful non -reasoning models, including GPT -4.5, Cloud 3.7 Sant, and Gemini 2.0 Flash Scores, about 1 %.

The arc-eg test contains the problems like the puzzle where AI has to identify visual patterns from a storage of different colors, and prepare the correct “answer” grid. Problems were designed to force an AI to adapt to new problems that had not been seen before.

More than 400 people at the Arc Prize Foundation were taking Take for the establishment of a human base line on the Arc -EG -2. On average, 60 % of the tests of those “panels” are fine – which are far better than any of the model scores.

Screenshot 2025 03 24 at 3.16.48PM — A sample question of Arc-Agi-2 (Credit: Arc Prize).

A post on xCh himt claimed that the first repetition of the Arc -E -2 test, an AI model is a better step in real intelligence than the Arc -EG -1. The ARC Prize Foundation’s tests aims to assess whether the AI system can effectively achieve new skills outside the data on which it was trained.

Unlike the Arc-E-1, the new test prevents AI models from relying on the “brot force”-extensive computing power-to find a solution, Ch himt said. Cholt had previously acknowledged It was a major flaw of the Arc -E -1.

To remove the first test flaws, Arc -EG -2 introduces performance: Performance. It also requires models to translate samples on bees instead of relying on memorization.

The Arc Prize Foundation co -founder Greg Comradit wrote in one, “The intelligence of the intelligence has not been described as the ability to solve the problems or get a higher score.” Blog Post. “The performance with which these capabilities have been achieved and deployed is an important, fixed component. The basic question being asked is not just, ‘can AI get [the] Skills to solve a job? ‘But also,’ at what performance or price? ‘

Arc -AG -1 December 2024 remained unbeaten for nearly five years when Openai released it Advanced reasoning model, O3Which improved all other AI models and fought human performance on diagnosis. However, as we had noted at the time, O3 Performance on Arc -AG -1 came with heavy price tags.

Openi’s O3 Model-O3 (Low) version-which was about to reach new heights on the Arc-Agi-1, which scored 75.7 % in the test, 4 % measured at ARC-AG-2 using $ 200 worth of computing power on Arc-AGI-2.

Screenshot 2025 03 24 at 3.18.29PM — Compare Frontier AI model performance on Arc -AGI -1 and Arc -EG -2 (Credit: Arc Prize).

The arrival of the Arc -EG -2 has come to light as many people in the tech industry are demanding new, dissatisfied standards to measure AI’s progress. Thomas Wolf, a co -founder of the begging face, recently told Tech Crunch The AI industry lacks enough tests to measure the key traits of the so -called artificial general intelligenceIncluding creativity.

Along with the new benchmark, the Arc Prize Foundation announced A new arc prize 2025 competitionTo challenge developers, to reach 85 % accuracy on the Arc -AG -2 test, while only $ 0.42 per task is to be spent.

Source link

Leave a Reply Cancel reply