S. Roy

Blog Post

Cheatsheet: PPO

Every equation in PPO annotated term-by-term — the clipped surrogate, GAE, value loss, and entropy bonus — with links to the posts and visuals explaining each design choice.

Views: 5 min readCite

Proximal Policy Optimisation (PPO) is the dominant on-policy RL algorithm for LLM alignment. Its key insight is replacing a hard trust-region constraint with a clipped probability ratio that is easy to optimise with standard gradient descent. This cheatsheet walks every equation.


The full PPO objective

LPPO(θ)=LCLIP(θ)c1LVF(θ)+c2H[πθ]\mathcal{L}_{\text{PPO}}(\theta) = \mathcal{L}_{\text{CLIP}}(\theta) - c_1\, \mathcal{L}_{\text{VF}}(\theta) + c_2\, \mathcal{H}[\pi_\theta]

Three terms are summed with coefficients c1c_1 and c2c_2. Each serves a different purpose and can be understood independently.


1. The clipped surrogate LCLIP\mathcal{L}_{\text{CLIP}}

LCLIP(θ)=Et ⁣[min ⁣(rt(θ)A^t, clip(rt(θ),1ε,1+ε)A^t)]\mathcal{L}_{\text{CLIP}}(\theta) = \mathbb{E}_t\!\left[\min\!\left(r_t(\theta)\,\hat{A}_t,\ \text{clip}(r_t(\theta), 1{-}\varepsilon, 1{+}\varepsilon)\,\hat{A}_t\right)\right]

rt(θ)r_t(\theta) — probability ratio

rt(θ)=πθ(atst)πθold(atst)r_t(\theta) = \frac{\pi_\theta(a_t \mid s_t)}{\pi_{\theta_{\text{old}}}(a_t \mid s_t)}

The ratio of the new policy's probability to the old policy's probability for action ata_t at state sts_t. This is the importance-sampling weight that lets us evaluate the new policy's performance using trajectories collected under the old one.

Why a ratio rather than the raw log-probability? The ratio connects directly to the expected advantage under the new policy via importance sampling: Eπθ[A]=Eπold[rtA]\mathbb{E}_{\pi_\theta}[A] = \mathbb{E}_{\pi_{\text{old}}}[r_t \cdot A]. See PerTokenRatioExplorer for a per-token breakdown.


A^t\hat{A}_t — the advantage estimate

The advantage A^t\hat{A}_t answers: how much better was action ata_t than average from state sts_t?

A^t=Q(st,at)V(st)\hat{A}_t = Q(s_t, a_t) - V(s_t)

In practice Q(st,at)Q(s_t, a_t) is not available; it is estimated with Generalised Advantage Estimation (GAE).

Why subtract V(st)V(s_t)? The value function is a baseline — any constant that does not depend on the action can be subtracted without biasing the gradient. Subtracting V(st)V(s_t) removes variance common to all actions from state sts_t, leaving only the action-specific signal. See BaselineVarianceReduction.


clip(rt,1ε,1+ε)\text{clip}(r_t, 1{-}\varepsilon, 1{+}\varepsilon) — the trust region

When A^t>0\hat{A}_t > 0 (good action), the min\min removes the gradient incentive once rt>1+εr_t > 1+\varepsilon — we do not keep pushing a token we have already made much more likely. Symmetrically, when A^t<0\hat{A}_t < 0 (bad action), the gradient is zeroed once rt<1εr_t < 1-\varepsilon.

Why clip instead of a hard constraint? TRPO enforces DKL(πθoldπθ)δ\mathbb{D}_{\text{KL}}(\pi_{\theta_{\text{old}}} \| \pi_\theta) \leq \delta as a hard constraint, which requires second-order optimisation. PPO approximates the same trust region effect with a first-order clipping operation — orders of magnitude cheaper and almost as stable. See PPOLossDecomposition.


2. Generalised Advantage Estimation (GAE)

A^tGAE(γ,λ)=l=0(γλ)lδt+l\hat{A}_t^{\text{GAE}(\gamma,\lambda)} = \sum_{l=0}^{\infty} (\gamma\lambda)^l\, \delta_{t+l}

where the TD residual is:

δt=rt+γV(st+1)V(st)\delta_t = r_t + \gamma\, V(s_{t+1}) - V(s_t)

δt\delta_t — one-step TD error

rtr_t is the immediate reward. γV(st+1)\gamma V(s_{t+1}) is the discounted value of the next state (what the critic predicts we will earn from here). Subtracting the current value V(st)V(s_t) gives the prediction error — positive if the step exceeded expectations.

Why use TD errors instead of raw returns? Raw Monte Carlo returns have high variance (the full trajectory contributes). TD errors bootstrap from the value function, trading some bias for greatly reduced variance. See GAEExplorer.

γ\gamma — discount factor

Return=r0+γr1+γ2r2+\text{Return} = r_0 + \gamma r_1 + \gamma^2 r_2 + \cdots

γ[0,1]\gamma \in [0,1] down-weights future rewards. At γ=0\gamma = 0 only the immediate reward matters; at γ=1\gamma = 1 all future rewards are equal.

Why discount? Discounting has two justifications. Mathematically: it ensures the infinite sum converges. Behaviourally: it models the agent's preference for sooner rewards, and in practice acts as a variance-reduction technique by down-weighting distant and uncertain returns. For LLMs with single-step episode rewards, γ\gamma is typically 1 or close to it.

λ\lambda — the GAE interpolation parameter

λ\lambda controls the blend between the one-step TD estimate (λ=0\lambda=0, low variance, high bias) and the full Monte Carlo return (λ=1\lambda=1, high variance, zero bias). Typical value: λ=0.95\lambda = 0.95.

Why not always use λ=1\lambda=1? The high variance of full Monte Carlo returns can swamp the signal in policy gradient updates, slowing convergence. See GAELambdaSweep for an interactive sweep.


3. Value function loss LVF\mathcal{L}_{\text{VF}}

LVF(θ)=Et ⁣[(Vθ(st)V^t)2]\mathcal{L}_{\text{VF}}(\theta) = \mathbb{E}_t\!\left[\left(V_\theta(s_t) - \hat{V}_t\right)^2\right]

where V^t=A^t+Vθold(st)\hat{V}_t = \hat{A}_t + V_{\theta_{\text{old}}}(s_t) is the TD target.

The critic VθV_\theta is trained alongside the policy to predict expected future rewards from each state. A well-trained critic gives accurate baselines, which reduces advantage variance — a virtuous cycle.

Why jointly train the critic? A separate critic requires its own optimiser and forward pass. Sharing parameters (or training jointly in the same loss) is more compute-efficient. The coefficient c1c_1 balances how aggressively the critic is updated relative to the policy.


4. Entropy bonus H[πθ]\mathcal{H}[\pi_\theta]

H[πθ]=Et ⁣[aπθ(ast)logπθ(ast)]\mathcal{H}[\pi_\theta] = -\mathbb{E}_t\!\left[\sum_a \pi_\theta(a \mid s_t) \log \pi_\theta(a \mid s_t)\right]

Maximising entropy encourages the policy to remain spread across actions rather than collapsing to a single deterministic output.

Why add entropy? A deterministic policy stops exploring. For LLMs this manifests as repetitive, low-diversity text. The entropy bonus keeps the model from converging prematurely to a degenerate high-reward mode. See EntropyCollapseDemo.


Coefficients at a glance

TermCoefficientTypical valueRole
LCLIP\mathcal{L}_{\text{CLIP}}1Policy improvement
LVF\mathcal{L}_{\text{VF}}c1c_10.5–1.0Critic accuracy
H\mathcal{H}c2c_20.01Exploration

Key hyperparameters

SymbolTypical valueRole
ε\varepsilon0.1–0.2PPO clip ratio
γ\gamma0.99–1.0Discount factor
λ\lambda0.95GAE interpolation
KK3–4Gradient steps per rollout

Cite this work

Generated from article front matter.

Roy, Swastik. (2025). Cheatsheet: PPO. S. Roy. https://swastikroy.me/blog/cheatsheet-ppo

Export PDF opens your browser’s print dialog — choose “Save as PDF” for a Zenodo-ready file.