Have been taking different local open-weight LLMs for a test drive in different harnesses (Qwen-Code, Codex, Claude Code).
30B Mixture-of-Expert models are kind of a nice sweet spot and can solve challenging problems. And they get roughly 40 tok/sec on a Mac or DGX Spark, which is similar to GPT 5.5 in a Pro subscription and totally useable for everyday work.
More interesting is also the harness choice! Claude Code seems to be using 2x many tokens as Codex.
Gemma 4 E2B is here just for reference to show that the tasks can't be trivially solved by smaller models.
Just finishing a longer write-up about this and will share soon (likely tomorrow)!
Jun 26
at
2:42 PM
Relevant people
Log in or sign up
Join the most interesting and insightful discussions.