Make money doing the work you believe in

Most people are aware of KL regularization for RL, but initial work (e.g., PPO paper) also used an entropy bonus for regularization purposes.

From an information theory perspective, entropy captures the level of uncertainty associated with the possible states for a variable:

- High entropy: probability mass is spread across many outcomes.

- Low entropy: probability mass is concentrated on a few outcomes.

In the LLM domain, we can measure the entropy of a model’s token distribution—low entropy means that the LLM places most of its probability into a small set of tokens and vice versa. Specifically, we can compute entropy using the equation shown in the image.

Usually, entropy is computed for each token (i.e., at each decoding step) and then averaged across the generated trajectory. See code in the image.

After computing the entropy, we can turn it into an entropy bonus and use it as a regularization term by simply scaling it with a coefficient β and incorporating it into either the reward—this is done in the original PPO paper—or the objective function. This is basically the same way we add KL divergence into RL objective.

The purpose of the entropy bonus is to prevent the LLM from becoming overly confident in its token distribution and, in turn, avoid any premature entropy collapse.

Similarly to the KL divergence, entropy bonuses are now more commonly incorporated into the loss function. Recent work on RL uses less regularization (e.g., using no KL divergence is now a common approach). But, there are still recent RL + reasoning papers you will see that use an entropy bonus! For example:

Apr 25
at
5:01 PM
Relevant people

Log in or sign up

Join the most interesting and insightful discussions.