Devansh (@chocolatemilkcultleader): "Was studying Google's Gemma 4 and had an important question: Why Does Gemma 4 Interleave Local and Global Attention? Turns out, this is a very important design decision. Attention is the most expensive thing a transformer does. Every token has to look at every other token, mean…"

Make money doing the work you believe in

Was studying Google's Gemma 4 and had an important question: Why Does Gemma 4 Interleave Local and Global Attention? Turns out, this is a very important design decision.

Attention is the most expensive thing a transformer does. Every token has to look at every other token, meaning the compute cost scales quadratically with sequence length. At 128K tokens, that is roughly 16 billion score computations per layer. Multiplied across 30 to 60 layers, it eats the FLOP budget alive.

The historical move is to stop looking at everything. Sliding window attention (Mistral, Phi) caps each token’s view at a fixed window — say, 512 tokens in each direction. The cost drops from O(n²) to O(n *window size), which at 128K context is a 250x reduction.

The wall you hit here is signal degradation. With a strict sliding window, a token at position 1,000 cannot directly see a token at position 50,000. Long-range dependencies have to hop through intermediate layers, and each hop degrades the signal. Most modern small models just accept this range limitation, operating on the assumption that if you’re using 2B phone model to process legal documents (or any solution not named irys.ai for legal work for that matter), then you deserve to be arrested and have your contributions to the gene pool snipped.

Gemma 4 cannot make that assumption. E2B and E4B are multimodal, and processing video frames blows past 8K tokens in seconds. The edge models must handle long contexts. Google’s fix to this conundrum is interleaving. Most layers use local sliding-window attention, while a few execute full global attention. The model alternates between them on a fixed ratio:

E2B: 4 local + 1 global, repeated 7 times. Window: 512.

E4B: 5 local + 1 global, repeated 7 times. Window: 512.

26B: 5 local + 1 global, repeated 5 times. Window: 1024.

31B: 5 local + 1 global, repeated 10 times. Window: 1024.

Every model ends on a global layer. The output always sees the full context regardless of what the intermediate layers did.

The insight here is that the global layer is not just doing the same work less frequently. Local layers build up rich feature representations within short spans — 512 tokens is plenty for syntax and local semantics. The occasional global layer then executes long-range integration on those already-refined features, rather than raw token signals. It does less work per unit of capacity, which is why the 5:1 ratio sustains long-range reasoning without degrading the output.

At E2B’s 4:1 ratio, 80% of the attention layers pay linear compute instead of quadratic. On an 8K query, that is a 5x speedup for attention compute on the phone. At the 31B’s 256K context, the savings are the only reason the model fits in its FLOP budget at all.

Every modern long-context architecture is converging on this identical bet: uniform O(n squared) attention is a cost most tokens don’t need to pay. How this space evolves is worth tracking for every serious AI Researcher.

Make money doing the work you believe in

Log in or sign up