Cameron R. Wolfe, Ph.D. (@cwolferesearch): "Most people are aware of KL regularization for RL, but initial work (e.g., PPO paper) also used an entropy bonus for regularization purposes. From an information theory perspective, entropy captures the level of uncertainty associated with the possible states for a variable: -…"

Make money doing the work you believe in

Most people are aware of KL regularization for RL, but initial work (e.g., PPO paper) also used an entropy bonus for regularization purposes.

From an information theory perspective, entropy captures the level of uncertainty associated with the possible states for a variable:

- High entropy: probability mass is spread across many outcomes.

- Low entropy: probability mass is concentrated on a few outcomes.

In the LLM domain, we can measure the entropy of a model’s token distribution—low entropy means that the LLM places most of its probability into a small set of tokens and vice versa. Specifically, we can compute entropy using the equation shown in the image.

Usually, entropy is computed for each token (i.e., at each decoding step) and then averaged across the generated trajectory. See code in the image.

After computing the entropy, we can turn it into an entropy bonus and use it as a regularization term by simply scaling it with a coefficient β and incorporating it into either the reward—this is done in the original PPO paper—or the objective function. This is basically the same way we add KL divergence into RL objective.

The purpose of the entropy bonus is to prevent the LLM from becoming overly confident in its token distribution and, in turn, avoid any premature entropy collapse.

Similarly to the KL divergence, entropy bonuses are now more commonly incorporated into the loss function. Recent work on RL uses less regularization (e.g., using no KL divergence is now a common approach). But, there are still recent RL + reasoning papers you will see that use an entropy bonus! For example:

arxiv.org

IsoCompute Playbook: Optimally Scaling Sampling Compute for LLM RL

While scaling laws guide compute allocation for LLM pre-training, analogous prescriptions for reinforcement learning (RL) post-training of large language models (LLMs) remain poorly understood. We study the compute-optimal allocation of sampling compute for on-policy RL methods in LLMs, framing scal…

Apr 25

5:01 PM

Make money doing the work you believe in

Log in or sign up