Blog Post
Cheatsheet: RL Loss Functions
PPO and GRPO loss functions annotated term-by-term — the clipped surrogate, GAE, value loss, entropy bonus, and group-normalised advantages — with links to the posts explaining each design choice.
Views: –6 min readCite
The two dominant RL algorithms for LLM alignment share a common core — the PPO clipped surrogate — but differ in how they estimate advantages and what auxiliary losses they carry. This cheatsheet annotates both objectives term-by-term and highlights where they diverge.
PPO objective
Three terms summed with coefficients and .
— probability ratio
The ratio of the new policy's probability to the old policy's for action . When the new policy assigns more probability to this token; when , less. This is the importance-sampling weight that lets us evaluate the new policy using trajectories collected under the old one:
See PerTokenRatioExplorer for a per-token breakdown.
— clipped surrogate
The of the unclipped and clipped terms acts as a pessimistic bound:
- When (good action): if the gradient is zeroed — no further push once the action is already much more likely.
- When (bad action): if the gradient is zeroed — no further suppression once the action is already much less likely.
Why clip instead of a hard KL constraint? TRPO enforces as a hard constraint, requiring second-order optimisation. PPO approximates the same trust region with a first-order clipping operation. See PPOLossDecomposition.
— advantage estimate (PPO)
How much better was action than average from state ? In practice is not available and is estimated via GAE.
Why subtract ? Any constant that doesn't depend on the action can be subtracted without biasing the gradient. Subtracting removes variance common to all actions from state . See BaselineVarianceReduction.
GAE — Generalised Advantage Estimation
is the one-step TD error: immediate reward plus discounted next-state value minus current value prediction. Positive = step exceeded expectations.
— discount factor: down-weights future rewards. At only the immediate reward matters; at all future rewards are equal. For LLMs with single-step episode rewards, is typically 1 or close to it.
— GAE interpolation: blends the one-step TD estimate (, low variance, high bias) with the full Monte Carlo return (, high variance, zero bias). Typical value: . See GAELambdaSweep.
— value function loss
Trains the critic to predict expected future rewards from each state. A well-trained critic gives accurate baselines, reducing advantage variance. The coefficient (typically 0.5–1.0) balances how aggressively the critic is updated relative to the policy.
— entropy bonus
Maximising entropy keeps the policy spread across actions rather than collapsing to a single deterministic output. Without it, LLM policies produce repetitive, low-diversity text. (typically 0.01) controls the exploration-exploitation trade-off. See EntropyCollapseDemo.
PPO hyperparameters
| Symbol | Typical value | Role |
|---|---|---|
| 0.1–0.2 | PPO clip ratio | |
| 0.99–1.0 | Discount factor | |
| 0.95 | GAE interpolation | |
| 0.5–1.0 | Value loss coefficient | |
| 0.01 | Entropy bonus coefficient | |
| 3–4 | Gradient steps per rollout |
GRPO objective
GRPO removes the learned value critic entirely. The advantage comes from comparing completions within a group, not from a separate value network.
— group sampling
completions sampled from the old policy for the same prompt . The group is the unit of normalisation — all outputs share one baseline computed from their collective reward signal.
Why sample a group? A single reward per prompt has no baseline to subtract. Generating outputs and comparing them against each other recovers a low-variance advantage without training a value network. See GRPOAdvantageExplorer.
— group-normalised advantage
The same advantage value is broadcast to every token position in completion — there is no token-level credit assignment within a single completion.
Why subtract the group mean? Any constant baseline that doesn't depend on the action can be subtracted without biasing the gradient. The group mean eliminates variance common to all outputs for this prompt.
Why divide by group std? Makes the advantage scale consistent across prompts and reward models — necessary for stable updates when reward distributions vary.
— probability ratio (GRPO)
Identical in form to the PPO ratio. When the new policy assigns more probability to this token than the old policy did. The same PPO clipped surrogate is applied.
— per-token normalisation
Divides by output length to prevent the objective from being dominated by long completions. Each completion contributes equally regardless of token count.
— KL penalty
Penalises the policy for drifting too far from the reference model (usually the SFT checkpoint). Without it, the policy finds degenerate outputs that exploit the reward model — reward hacking. See KLPenaltyTradeoff.
Forward KL (): mode-covering — penalises for putting mass anywhere does not. This keeps the policy in the reference distribution's support rather than collapsing onto a single high-reward mode.
trade-off: large keeps the policy close to the reference (safe, less reward); small allows more deviation (higher reward potential, more hacking risk). See KLDivergenceExplorer.
GRPO hyperparameters
| Symbol | Typical value | Role |
|---|---|---|
| 4–16 | Group size — more = lower variance baseline, more compute | |
| 0.1–0.2 | PPO clip ratio | |
| 0.01–0.1 | KL penalty strength |
PPO vs GRPO at a glance
| PPO | GRPO | |
|---|---|---|
| Advantage source | Learned value function + GAE | Group mean/std of scalar rewards |
| Auxiliary loss | Value function loss + entropy | KL penalty to |
| Extra networks | Critic (value head) | Frozen reference model |
| Credit assignment | Per-token via GAE bootstrapping | Per-completion, broadcast to all tokens |
| Main failure mode | Value model collapse at long horizons | High variance when is small |