The Grok team seems to be overfitting their models to make them look good on evaluations. I tested Grok 4 vs. o3 vs. Opus 4 and formed a solid sense of where the new Grok actually ranks (spoiler it’s not #1).
Come for the discovery that Grok is a statistically more likely snitch than any other model, stay for the discussion of what’s really causing Grok’s personality issues with Elon.