Cheatsheet: PPO

Swastik Roy

Blog Post

Cheatsheet: PPO

Every equation in PPO annotated term-by-term — the clipped surrogate, GAE, value loss, and entropy bonus — with links to the posts and visuals explaining each design choice.

January 10, 2025Views: –5 min readCite

cheatsheet ppo rl llm-training alignment

Proximal Policy Optimisation (PPO) is the dominant on-policy RL algorithm for LLM alignment. Its key insight is replacing a hard trust-region constraint with a clipped probability ratio that is easy to optimise with standard gradient descent. This cheatsheet walks every equation.

The full PPO objective

\mathcal{L}_{\text{PPO}}(\theta) = \mathcal{L}_{\text{CLIP}}(\theta) - c_1\, \mathcal{L}_{\text{VF}}(\theta) + c_2\, \mathcal{H}[\pi_\theta]

Three terms are summed with coefficients $c_1$ and $c_2$ . Each serves a different purpose and can be understood independently.

1. The clipped surrogate $\mathcal{L}_{\text{CLIP}}$

\mathcal{L}_{\text{CLIP}}(\theta) = \mathbb{E}_t\!\left[\min\!\left(r_t(\theta)\,\hat{A}_t,\ \text{clip}(r_t(\theta), 1{-}\varepsilon, 1{+}\varepsilon)\,\hat{A}_t\right)\right]

$r_t(\theta)$ — probability ratio

r_t(\theta) = \frac{\pi_\theta(a_t \mid s_t)}{\pi_{\theta_{\text{old}}}(a_t \mid s_t)}

The ratio of the new policy's probability to the old policy's probability for action $a_t$ at state $s_t$ . This is the importance-sampling weight that lets us evaluate the new policy's performance using trajectories collected under the old one.

Why a ratio rather than the raw log-probability? The ratio connects directly to the expected advantage under the new policy via importance sampling: $\mathbb{E}_{\pi_\theta}[A] = \mathbb{E}_{\pi_{\text{old}}}[r_t \cdot A]$ . See PerTokenRatioExplorer for a per-token breakdown.

$\hat{A}_t$ — the advantage estimate

The advantage $\hat{A}_t$ answers: how much better was action $a_t$ than average from state $s_t$ ?

\hat{A}_t = Q(s_t, a_t) - V(s_t)

In practice $Q(s_t, a_t)$ is not available; it is estimated with Generalised Advantage Estimation (GAE).

Why subtract $V(s_t)$ ? The value function is a baseline — any constant that does not depend on the action can be subtracted without biasing the gradient. Subtracting $V(s_t)$ removes variance common to all actions from state $s_t$ , leaving only the action-specific signal. See BaselineVarianceReduction.

$\text{clip}(r_t, 1{-}\varepsilon, 1{+}\varepsilon)$ — the trust region

When $\hat{A}_t > 0$ (good action), the $\min$ removes the gradient incentive once $r_t > 1+\varepsilon$ — we do not keep pushing a token we have already made much more likely. Symmetrically, when $\hat{A}_t < 0$ (bad action), the gradient is zeroed once $r_t < 1-\varepsilon$ .

Why clip instead of a hard constraint? TRPO enforces $\mathbb{D}_{\text{KL}}(\pi_{\theta_{\text{old}}} \| \pi_\theta) \leq \delta$ as a hard constraint, which requires second-order optimisation. PPO approximates the same trust region effect with a first-order clipping operation — orders of magnitude cheaper and almost as stable. See PPOLossDecomposition.

2. Generalised Advantage Estimation (GAE)

\hat{A}_t^{\text{GAE}(\gamma,\lambda)} = \sum_{l=0}^{\infty} (\gamma\lambda)^l\, \delta_{t+l}

where the TD residual is:

\delta_t = r_t + \gamma\, V(s_{t+1}) - V(s_t)

$\delta_t$ — one-step TD error

$r_t$ is the immediate reward. $\gamma V(s_{t+1})$ is the discounted value of the next state (what the critic predicts we will earn from here). Subtracting the current value $V(s_t)$ gives the prediction error — positive if the step exceeded expectations.

Why use TD errors instead of raw returns? Raw Monte Carlo returns have high variance (the full trajectory contributes). TD errors bootstrap from the value function, trading some bias for greatly reduced variance. See GAEExplorer.

$\gamma$ — discount factor

\text{Return} = r_0 + \gamma r_1 + \gamma^2 r_2 + \cdots

$\gamma \in [0,1]$ down-weights future rewards. At $\gamma = 0$ only the immediate reward matters; at $\gamma = 1$ all future rewards are equal.

Why discount? Discounting has two justifications. Mathematically: it ensures the infinite sum converges. Behaviourally: it models the agent's preference for sooner rewards, and in practice acts as a variance-reduction technique by down-weighting distant and uncertain returns. For LLMs with single-step episode rewards, $\gamma$ is typically 1 or close to it.

$\lambda$ — the GAE interpolation parameter

$\lambda$ controls the blend between the one-step TD estimate ( $\lambda=0$ , low variance, high bias) and the full Monte Carlo return ( $\lambda=1$ , high variance, zero bias). Typical value: $\lambda = 0.95$ .

Why not always use $\lambda=1$ ? The high variance of full Monte Carlo returns can swamp the signal in policy gradient updates, slowing convergence. See GAELambdaSweep for an interactive sweep.

3. Value function loss $\mathcal{L}_{\text{VF}}$

\mathcal{L}_{\text{VF}}(\theta) = \mathbb{E}_t\!\left[\left(V_\theta(s_t) - \hat{V}_t\right)^2\right]

where $\hat{V}_t = \hat{A}_t + V_{\theta_{\text{old}}}(s_t)$ is the TD target.

The critic $V_\theta$ is trained alongside the policy to predict expected future rewards from each state. A well-trained critic gives accurate baselines, which reduces advantage variance — a virtuous cycle.

Why jointly train the critic? A separate critic requires its own optimiser and forward pass. Sharing parameters (or training jointly in the same loss) is more compute-efficient. The coefficient $c_1$ balances how aggressively the critic is updated relative to the policy.

4. Entropy bonus $\mathcal{H}[\pi_\theta]$

\mathcal{H}[\pi_\theta] = -\mathbb{E}_t\!\left[\sum_a \pi_\theta(a \mid s_t) \log \pi_\theta(a \mid s_t)\right]

Maximising entropy encourages the policy to remain spread across actions rather than collapsing to a single deterministic output.

Why add entropy? A deterministic policy stops exploring. For LLMs this manifests as repetitive, low-diversity text. The entropy bonus keeps the model from converging prematurely to a degenerate high-reward mode. See EntropyCollapseDemo.

Coefficients at a glance

Term	Coefficient	Typical value	Role
$\mathcal{L}_{\text{CLIP}}$	1	—	Policy improvement
$\mathcal{L}_{\text{VF}}$	$c_1$	0.5–1.0	Critic accuracy
$\mathcal{H}$	$c_2$	0.01	Exploration

Key hyperparameters

Symbol	Typical value	Role
$\varepsilon$	0.1–0.2	PPO clip ratio
$\gamma$	0.99–1.0	Discount factor
$\lambda$	0.95	GAE interpolation
$K$	3–4	Gradient steps per rollout

Cheatsheet: PPO

The full PPO objective

1. The clipped surrogate $\mathcal{L}_{\text{CLIP}}$

$r_t(\theta)$ — probability ratio

$\hat{A}_t$ — the advantage estimate

$\text{clip}(r_t, 1{-}\varepsilon, 1{+}\varepsilon)$ — the trust region

2. Generalised Advantage Estimation (GAE)

$\delta_t$ — one-step TD error

$\gamma$ — discount factor

$\lambda$ — the GAE interpolation parameter

3. Value function loss $\mathcal{L}_{\text{VF}}$

4. Entropy bonus $\mathcal{H}[\pi_\theta]$

Coefficients at a glance

Key hyperparameters

How to cite this article

Cite this work