Sebastian Raschka, PhD (@rasbt): "From GPT to MoE: I reviewed & compared the main LLMs of 2025 in terms of their architectural design from DeepSeek-V3 to Kimi 2. & Qwen3. https://magazine.sebastianraschka.com/p/the-big-llm-architecture-comparison Multi-head Latent Attention, sliding window attention, new Post-…"

From GPT to MoE: I reviewed & compared the main LLMs of 2025 in terms of their architectural design from DeepSeek-V3 to Kimi 2. & Qwen3.

Multi-head Latent Attention, sliding window attention, new Post- & Pre-Norm placements, NoPE, shared-expert MoEs, and more...

The Big LLM Architecture Comparison

Jul 23

2:30 PM