Nate (@natesnewsletter): "The Grok team seems to be overfitting their models to make them look good on evaluations. I tested Grok 4 vs. o3 vs. Opus 4 and formed a solid sense of where the new Grok actually ranks (spoiler it’s not #1). Come for the discovery that Grok is a statistically more likely snit…"

The app for independent voices

Jul 14, 2025

The Grok team seems to be overfitting their models to make them look good on evaluations. I tested Grok 4 vs. o3 vs. Opus 4 and formed a solid sense of where the new Grok actually ranks (spoiler it’s not #1).

Come for the discovery that Grok is a statistically more likely snitch than any other model, stay for the discussion of what’s really causing Grok’s personality issues with Elon.

Cheers!

Grok 4 is "#1" But Real-World Users Ranked it #66—Here's the Gap

Grok didn’t tell the truth to the world, again.

Nate’s Substack

Jul 14

7:19 PM

The app for independent voices

Grok 4 is "#1" But Real-World Users Ranked it #66—Here's the Gap

Log in or sign up