Youssef Hosni (@youssefhosni95): "Google has published a paper that might mark the beginning of the end of the Transformer-only era. Transformers have growing token-level memory, which gives strong recall but high long-context cost. RNN-style models are efficient, but they compress the entire past into one fix…"

Make money doing the work you believe in

Google has published a paper that might mark the beginning of the end of the Transformer-only era.

Transformers have growing token-level memory, which gives strong recall but high long-context cost.

RNN-style models are efficient, but they compress the entire past into one fixed-size state.

Memory Caching creates a middle ground.

It lets recurrent models save compressed memory checkpoints across sequence segments. Later tokens can be retrieved from the current memory and from older cached memories.

So instead of:

one fixed memory state

We get:

many compressed memory checkpoints + query-dependent retrieval

The paper explores several variants: Residual Memory, Gated Residual Memory, Memory Soup, and Sparse Selective Caching.

This does not mean Transformers are dead.

It means full attention may no longer be the only serious path to growing memory.

I wrote a detailed blog explaining the paper, the history behind it, the method, the experiments, and what this could mean for future long-context architectures.

Read it from here:

To Data & Beyond

Google Published a Paper That Might End the Transformer-Only LLM Era

Jun 21

9:36 AM

Make money doing the work you believe in

Log in or sign up