Make money doing the work you believe in

The idea of training LLMs to manage their own KV cache is super interesting to me. The recent neural garbage collection (NGC) paper was a great read on this topic.

Reasoning models / agents obviously need long sequences to handle complex reasoning, long horizon tasks, tool calls, etc. However, the size of the KV increases linearly with the length of your sequence, creating a KV cache bottleneck.

To solve this, there are several heuristics that have been proposed; e.g., only keeping recent tokens, keeping tokens with high attention scores, etc. But these heuristics tend to degrade performance and may or may not work well depending on the domain / task.

Instead of using heuristics, we can try to teach the LLM to manage its own KV cache. Concretely, NGC does this by implementing an eviction cadence. Every δ tokens during the decoding process, NGC scores all of its KV cache blocks and defines an eviction rate ϵ such that only (1 - ϵ) of KV cache blocks are kept. By doing this, we can ensure that the peak cache size is stable.

To score KV cache blocks, NGC does not use any new or specialized models / modules. Instead, it repurposes the LLM's existing attention mechanism. The model takes the most recent query vectors, partitions KV cache into fixed-size blocks, then scores previous keys based on the query vectors.

Instead of performing specialized training for managing the KV cache, NGC simply incorporates KV cache management into the verifiable loss for training with RL / GRPO. The RL objective both has both:

1. A component for normal token predictions.

2. A component for KV cache eviction decisions.

This way, we can train the model end-to-end with RL to correctly evict KV cache blocks (similarly to predicting a token) while still using outcome rewards.

Apr 24
at
12:43 AM
Relevant people

Log in or sign up

Join the most interesting and insightful discussions.