Blog Post
Cheatsheet: PPO
Every equation in PPO annotated term-by-term — the clipped surrogate, GAE, value loss, and entropy bonus — with links to the posts and visuals explaining each design choice.
Views: –5 min readCite
Proximal Policy Optimisation (PPO) is the dominant on-policy RL algorithm for LLM alignment. Its key insight is replacing a hard trust-region constraint with a clipped probability ratio that is easy to optimise with standard gradient descent. This cheatsheet walks every equation.
The full PPO objective
Three terms are summed with coefficients and . Each serves a different purpose and can be understood independently.
1. The clipped surrogate
— probability ratio
The ratio of the new policy's probability to the old policy's probability for action at state . This is the importance-sampling weight that lets us evaluate the new policy's performance using trajectories collected under the old one.
Why a ratio rather than the raw log-probability? The ratio connects directly to the expected advantage under the new policy via importance sampling: . See PerTokenRatioExplorer for a per-token breakdown.
— the advantage estimate
The advantage answers: how much better was action than average from state ?
In practice is not available; it is estimated with Generalised Advantage Estimation (GAE).
Why subtract ? The value function is a baseline — any constant that does not depend on the action can be subtracted without biasing the gradient. Subtracting removes variance common to all actions from state , leaving only the action-specific signal. See BaselineVarianceReduction.
— the trust region
When (good action), the removes the gradient incentive once — we do not keep pushing a token we have already made much more likely. Symmetrically, when (bad action), the gradient is zeroed once .
Why clip instead of a hard constraint? TRPO enforces as a hard constraint, which requires second-order optimisation. PPO approximates the same trust region effect with a first-order clipping operation — orders of magnitude cheaper and almost as stable. See PPOLossDecomposition.
2. Generalised Advantage Estimation (GAE)
where the TD residual is:
— one-step TD error
is the immediate reward. is the discounted value of the next state (what the critic predicts we will earn from here). Subtracting the current value gives the prediction error — positive if the step exceeded expectations.
Why use TD errors instead of raw returns? Raw Monte Carlo returns have high variance (the full trajectory contributes). TD errors bootstrap from the value function, trading some bias for greatly reduced variance. See GAEExplorer.
— discount factor
down-weights future rewards. At only the immediate reward matters; at all future rewards are equal.
Why discount? Discounting has two justifications. Mathematically: it ensures the infinite sum converges. Behaviourally: it models the agent's preference for sooner rewards, and in practice acts as a variance-reduction technique by down-weighting distant and uncertain returns. For LLMs with single-step episode rewards, is typically 1 or close to it.
— the GAE interpolation parameter
controls the blend between the one-step TD estimate (, low variance, high bias) and the full Monte Carlo return (, high variance, zero bias). Typical value: .
Why not always use ? The high variance of full Monte Carlo returns can swamp the signal in policy gradient updates, slowing convergence. See GAELambdaSweep for an interactive sweep.
3. Value function loss
where is the TD target.
The critic is trained alongside the policy to predict expected future rewards from each state. A well-trained critic gives accurate baselines, which reduces advantage variance — a virtuous cycle.
Why jointly train the critic? A separate critic requires its own optimiser and forward pass. Sharing parameters (or training jointly in the same loss) is more compute-efficient. The coefficient balances how aggressively the critic is updated relative to the policy.
4. Entropy bonus
Maximising entropy encourages the policy to remain spread across actions rather than collapsing to a single deterministic output.
Why add entropy? A deterministic policy stops exploring. For LLMs this manifests as repetitive, low-diversity text. The entropy bonus keeps the model from converging prematurely to a degenerate high-reward mode. See EntropyCollapseDemo.
Coefficients at a glance
| Term | Coefficient | Typical value | Role |
|---|---|---|---|
| 1 | — | Policy improvement | |
| 0.5–1.0 | Critic accuracy | ||
| 0.01 | Exploration |
Key hyperparameters
| Symbol | Typical value | Role |
|---|---|---|
| 0.1–0.2 | PPO clip ratio | |
| 0.99–1.0 | Discount factor | |
| 0.95 | GAE interpolation | |
| 3–4 | Gradient steps per rollout |