This isn’t a random metric. This test is supposed to be easy for humans and hard for AI. It’s designed to measure abstract reasoning, not pattern-matching.
Not long ago, GPT-4 was scoring between 0% and 10%.
How is that possible?
Will we hit 60% (human-level) by EOY 2025?
…
Some say the next metric (GDPval) is even more impressive. It measures real-world knowledge work across 44 jobs. GPT-5.2 matches or beats humans about 70.9% of the time.
But I’m a bit skeptical about this one. Unlike GDPval, ARC-AGI-2 was designed to minimize simple pattern-matching.
…
Tonight is going to be a testing night!
Dec 11
at
10:11 PM
Relevant people
Log in or sign up
Join the most interesting and insightful discussions.