Of course, "benchmarks != real world performance", and benchmarks have many issues. But what an exciting week for coding LLMs.
We got the open-weight Qwen3-Next-Coder, and we just got the Codex 5.3 / Opus 4.6 double release.
Unfortunately, Anthropic didn’t share SWE Bench Pro benchmarks, but here I put them side by side based on the available Terminus 2.0 numbers: