Google has published a paper that might mark the beginning of the end of the Transformer-only era.
Transformers have growing token-level memory, which gives strong recall but high long-context cost.
RNN-style models are efficient, but they compress the entire past into one fixed-size state.
Memory Caching creates a middle ground.
It lets recurrent models save compressed memory checkpoints across sequence segments. Later tokens can be retrieved from the current memory and from older cached memories.
So instead of:
one fixed memory state
We get:
many compressed memory checkpoints + query-dependent retrieval
The paper explores several variants: Residual Memory, Gated Residual Memory, Memory Soup, and Sparse Selective Caching.
This does not mean Transformers are dead.
It means full attention may no longer be the only serious path to growing memory.
I wrote a detailed blog explaining the paper, the history behind it, the method, the experiments, and what this could mean for future long-context architectures.
Read it from here: