Paweł Huryn (@huryn): "Wild: GPT-5.2 just scored 52.9% on ARC-AGI-2. The average human score is about 60%. This isn’t a random metric. This test is supposed to be easy for humans and hard for AI. It’s designed to measure abstract reasoning, not pattern-matching. Not long ago, GPT-4 was scoring betw…"

The app for independent voices

Wild: GPT-5.2 just scored 52.9% on ARC-AGI-2.

The average human score is about 60%.

This isn’t a random metric. This test is supposed to be easy for humans and hard for AI. It’s designed to measure abstract reasoning, not pattern-matching.

Not long ago, GPT-4 was scoring between 0% and 10%.

How is that possible?

Will we hit 60% (human-level) by EOY 2025?

…

Some say the next metric (GDPval) is even more impressive. It measures real-world knowledge work across 44 jobs. GPT-5.2 matches or beats humans about 70.9% of the time.

But I’m a bit skeptical about this one. Unlike GDPval, ARC-AGI-2 was designed to minimize simple pattern-matching.

…

Tonight is going to be a testing night!

Dec 11

10:11 PM

The app for independent voices

Log in or sign up