It's been a while since I did an LLM architecture post. Just stumbled upon the Arcee AI Trinity Large release + technical report released yesterday and couldn't resist:
- 400B param MoE (13B active params)
- Base model performance similar to GLM 4.5 base
- Alternating local:global (sliding window) attention layers in 3:1 ratio like Olmo 3
- QK-Norm (e.g. popular since Olmo 2) and NoPE (e.g., SMolLM3)
- Gated attention like Qwen3-Next
- Sandwich RMSNorm (kind of like Gemma 3 but depth-scaled)
- DeepSeek-like MoE with lots of small experts, but made it coarser as that helps with inference throughput (something we have also seen in Mistral 3 Large when they adopted the DeepSeek V3 architecture)
Added a slightly longe write-up to my The Big Architecture Comparison article: magazine.sebastianrasch…
Jan 29
at
3:56 PM
Log in or sign up
Join the most interesting and insightful discussions.