Youssef Hosni (@youssefhosni95): "GPT 5.5 underperforms Opus 4.7 on SWE-Bench Pro. Couldn't find any reported SWE-Bench scores at all and an internal benchmark is reported instead. That footnote is trying really hard to bury the lede. GPT 5.5 isn't SOTA for coding. The footnote: “*Anthropic reported signs of…"

GPT 5.5 underperforms Opus 4.7 on SWE-Bench Pro. Couldn't find any reported SWE-Bench scores at all and an internal benchmark is reported instead.

That footnote is trying really hard to bury the lede. GPT 5.5 isn't SOTA for coding.

The footnote: “*Anthropic reported signs of memorization on a subset of problems“

Apr 23

9:29 PM