Everyone's talking about how Chinese AI models caught up on quality. DeepSeek, Qwen, and others have genuinely closed the gap. But our partner Weijin Research (please subscribe!) makes a compelling case that the competition has quietly shifted to a completely different battlefield: inference speed.
The numbers are striking. Chinese open-source models run at roughly 100 tokens per second, priced from free to $3 per million tokens. US closed models are doing 400 to 1,000+ tokens per second at $45-150. That's not a small gap. It's a different economic tier entirely. And the reason isn't algorithms, it's hardware. Nvidia's Groq 3 LPU, Cerebras, and Microsoft's Maia 200 are all converging on SRAM-heavy chip architectures purpose-built for fast inference, and China has no domestic equivalent.
This matters because speed isn't just a nice-to-have. It unlocks entirely different categories of applications (real-time coding agents, interactive reasoning) that command premium pricing. Without access to these chips, Chinese providers are structurally locked out of the highest-value segment of the AI market, even if their models are just as smart.
You’ll find the link in comments.