The app for independent voices

The Grok team seems to be overfitting their models to make them look good on evaluations. I tested Grok 4 vs. o3 vs. Opus 4 and formed a solid sense of where the new Grok actually ranks (spoiler it’s not #1).

Come for the discovery that Grok is a statistically more likely snitch than any other model, stay for the discussion of what’s really causing Grok’s personality issues with Elon.

Cheers!

Jul 14
at
7:19 PM
Relevant people

Log in or sign up

Join the most interesting and insightful discussions.