S. Roy

Blog Post

Cheatsheet: RL Loss Functions

PPO and GRPO loss functions annotated term-by-term — the clipped surrogate, GAE, value loss, entropy bonus, and group-normalised advantages — with links to the posts explaining each design choice.

Views: 6 min readCite

The two dominant RL algorithms for LLM alignment share a common core — the PPO clipped surrogate — but differ in how they estimate advantages and what auxiliary losses they carry. This cheatsheet annotates both objectives term-by-term and highlights where they diverge.


PPO objective

LPPO(θ)=LCLIP(θ)c1LVF(θ)+c2H[πθ]\mathcal{L}_{\text{PPO}}(\theta) = \mathcal{L}_{\text{CLIP}}(\theta) - c_1\, \mathcal{L}_{\text{VF}}(\theta) + c_2\, \mathcal{H}[\pi_\theta]

Three terms summed with coefficients c1c_1 and c2c_2.


rt(θ)r_t(\theta) — probability ratio

rt(θ)=πθ(atst)πθold(atst)r_t(\theta) = \frac{\pi_\theta(a_t \mid s_t)}{\pi_{\theta_{\text{old}}}(a_t \mid s_t)}

The ratio of the new policy's probability to the old policy's for action ata_t. When r>1r > 1 the new policy assigns more probability to this token; when r<1r < 1, less. This is the importance-sampling weight that lets us evaluate the new policy using trajectories collected under the old one:

Eπθ[A]=Eπold[rtA]\mathbb{E}_{\pi_\theta}[A] = \mathbb{E}_{\pi_{\text{old}}}[r_t \cdot A]

See PerTokenRatioExplorer for a per-token breakdown.


LCLIP\mathcal{L}_{\text{CLIP}} — clipped surrogate

LCLIP(θ)=Et ⁣[min ⁣(rt(θ)A^t, clip(rt(θ),1ε,1+ε)A^t)]\mathcal{L}_{\text{CLIP}}(\theta) = \mathbb{E}_t\!\left[\min\!\left(r_t(\theta)\,\hat{A}_t,\ \text{clip}(r_t(\theta), 1{-}\varepsilon, 1{+}\varepsilon)\,\hat{A}_t\right)\right]

The min\min of the unclipped and clipped terms acts as a pessimistic bound:

  • When A^>0\hat{A} > 0 (good action): if r>1+εr > 1+\varepsilon the gradient is zeroed — no further push once the action is already much more likely.
  • When A^<0\hat{A} < 0 (bad action): if r<1εr < 1-\varepsilon the gradient is zeroed — no further suppression once the action is already much less likely.

Why clip instead of a hard KL constraint? TRPO enforces DKL(πθoldπθ)δ\mathbb{D}_{\text{KL}}(\pi_{\theta_{\text{old}}} \| \pi_\theta) \leq \delta as a hard constraint, requiring second-order optimisation. PPO approximates the same trust region with a first-order clipping operation. See PPOLossDecomposition.


A^t\hat{A}_t — advantage estimate (PPO)

A^t=Q(st,at)V(st)\hat{A}_t = Q(s_t, a_t) - V(s_t)

How much better was action ata_t than average from state sts_t? In practice QQ is not available and is estimated via GAE.

Why subtract V(st)V(s_t)? Any constant that doesn't depend on the action can be subtracted without biasing the gradient. Subtracting V(st)V(s_t) removes variance common to all actions from state sts_t. See BaselineVarianceReduction.


GAE — Generalised Advantage Estimation

A^tGAE(γ,λ)=l=0(γλ)lδt+l,δt=rt+γV(st+1)V(st)\hat{A}_t^{\text{GAE}(\gamma,\lambda)} = \sum_{l=0}^{\infty} (\gamma\lambda)^l\, \delta_{t+l}, \qquad \delta_t = r_t + \gamma\, V(s_{t+1}) - V(s_t)

δt\delta_t is the one-step TD error: immediate reward plus discounted next-state value minus current value prediction. Positive = step exceeded expectations.

γ\gamma — discount factor: down-weights future rewards. At γ=0\gamma=0 only the immediate reward matters; at γ=1\gamma=1 all future rewards are equal. For LLMs with single-step episode rewards, γ\gamma is typically 1 or close to it.

λ\lambda — GAE interpolation: blends the one-step TD estimate (λ=0\lambda=0, low variance, high bias) with the full Monte Carlo return (λ=1\lambda=1, high variance, zero bias). Typical value: λ=0.95\lambda=0.95. See GAELambdaSweep.


LVF\mathcal{L}_{\text{VF}} — value function loss

LVF(θ)=Et ⁣[(Vθ(st)V^t)2]\mathcal{L}_{\text{VF}}(\theta) = \mathbb{E}_t\!\left[\left(V_\theta(s_t) - \hat{V}_t\right)^2\right]

Trains the critic VθV_\theta to predict expected future rewards from each state. A well-trained critic gives accurate baselines, reducing advantage variance. The coefficient c1c_1 (typically 0.5–1.0) balances how aggressively the critic is updated relative to the policy.


H[πθ]\mathcal{H}[\pi_\theta] — entropy bonus

H[πθ]=Et ⁣[aπθ(ast)logπθ(ast)]\mathcal{H}[\pi_\theta] = -\mathbb{E}_t\!\left[\sum_a \pi_\theta(a \mid s_t) \log \pi_\theta(a \mid s_t)\right]

Maximising entropy keeps the policy spread across actions rather than collapsing to a single deterministic output. Without it, LLM policies produce repetitive, low-diversity text. c2c_2 (typically 0.01) controls the exploration-exploitation trade-off. See EntropyCollapseDemo.


PPO hyperparameters

SymbolTypical valueRole
ε\varepsilon0.1–0.2PPO clip ratio
γ\gamma0.99–1.0Discount factor
λ\lambda0.95GAE interpolation
c1c_10.5–1.0Value loss coefficient
c2c_20.01Entropy bonus coefficient
KK3–4Gradient steps per rollout

GRPO objective

LGRPO(θ)=Eq,{oi}[1Gi=1G1oit=1oimin ⁣(ri,t(θ)A^i,t, clip(ri,t(θ),1ε,1+ε)A^i,t)βDKL ⁣[πθπref]]\mathcal{L}_{\text{GRPO}}(\theta) = \mathbb{E}_{q,\,\{o_i\}} \left[ \frac{1}{G} \sum_{i=1}^{G} \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \min\!\left(r_{i,t}(\theta)\, \hat{A}_{i,t},\ \text{clip}(r_{i,t}(\theta), 1{-}\varepsilon, 1{+}\varepsilon)\, \hat{A}_{i,t}\right) - \beta\, \mathbb{D}_{\text{KL}}\!\left[\pi_\theta \,\|\, \pi_{\text{ref}}\right] \right]

GRPO removes the learned value critic entirely. The advantage comes from comparing completions within a group, not from a separate value network.


{oi}i=1Gπθold(q)\{o_i\}_{i=1}^{G} \sim \pi_{\theta_{\text{old}}}(\cdot \mid q) — group sampling

GG completions sampled from the old policy for the same prompt qq. The group is the unit of normalisation — all GG outputs share one baseline computed from their collective reward signal.

Why sample a group? A single reward per prompt has no baseline to subtract. Generating GG outputs and comparing them against each other recovers a low-variance advantage without training a value network. See GRPOAdvantageExplorer.


A^i,t\hat{A}_{i,t} — group-normalised advantage

A^i,t=rimean({rj}j=1G)std({rj}j=1G)\hat{A}_{i,t} = \frac{r_i - \text{mean}(\{r_j\}_{j=1}^{G})}{\text{std}(\{r_j\}_{j=1}^{G})}

The same advantage value is broadcast to every token position in completion ii — there is no token-level credit assignment within a single completion.

Why subtract the group mean? Any constant baseline that doesn't depend on the action can be subtracted without biasing the gradient. The group mean eliminates variance common to all outputs for this prompt.

Why divide by group std? Makes the advantage scale consistent across prompts and reward models — necessary for stable updates when reward distributions vary.


ri,t(θ)r_{i,t}(\theta) — probability ratio (GRPO)

ri,t(θ)=πθ(oi,tq,oi,<t)πθold(oi,tq,oi,<t)r_{i,t}(\theta) = \frac{\pi_\theta(o_{i,t} \mid q, o_{i,<t})}{\pi_{\theta_{\text{old}}}(o_{i,t} \mid q, o_{i,<t})}

Identical in form to the PPO ratio. When r>1r > 1 the new policy assigns more probability to this token than the old policy did. The same PPO clipped surrogate is applied.


1oi\frac{1}{|o_i|} — per-token normalisation

Divides by output length to prevent the objective from being dominated by long completions. Each completion contributes equally regardless of token count.


βDKL ⁣[πθπref]\beta\, \mathbb{D}_{\text{KL}}\!\left[\pi_\theta \,\|\, \pi_{\text{ref}}\right] — KL penalty

DKL ⁣[πθπref]=Eoπθ ⁣[logπθ(oq)πref(oq)]\mathbb{D}_{\text{KL}}\!\left[\pi_\theta \,\|\, \pi_{\text{ref}}\right] = \mathbb{E}_{o \sim \pi_\theta}\!\left[\log \frac{\pi_\theta(o \mid q)}{\pi_{\text{ref}}(o \mid q)}\right]

Penalises the policy for drifting too far from the reference model πref\pi_{\text{ref}} (usually the SFT checkpoint). Without it, the policy finds degenerate outputs that exploit the reward model — reward hacking. See KLPenaltyTradeoff.

Forward KL (πθπref\pi_\theta \| \pi_{\text{ref}}): mode-covering — penalises πθ\pi_\theta for putting mass anywhere πref\pi_{\text{ref}} does not. This keeps the policy in the reference distribution's support rather than collapsing onto a single high-reward mode.

β\beta trade-off: large β\beta keeps the policy close to the reference (safe, less reward); small β\beta allows more deviation (higher reward potential, more hacking risk). See KLDivergenceExplorer.


GRPO hyperparameters

SymbolTypical valueRole
GG4–16Group size — more = lower variance baseline, more compute
ε\varepsilon0.1–0.2PPO clip ratio
β\beta0.01–0.1KL penalty strength

PPO vs GRPO at a glance

PPOGRPO
Advantage sourceLearned value function + GAEGroup mean/std of scalar rewards
Auxiliary lossValue function loss LVF\mathcal{L}_{\text{VF}} + entropy H\mathcal{H}KL penalty to πref\pi_{\text{ref}}
Extra networksCritic (value head)Frozen reference model
Credit assignmentPer-token via GAE bootstrappingPer-completion, broadcast to all tokens
Main failure modeValue model collapse at long horizonsHigh variance when GG is small

Cite this work

Generated from article front matter.

Roy, Swastik. (2025). Cheatsheet: RL Loss Functions. S. Roy. https://swastikroy.me/blog/cheatsheet-rl-losses

Export PDF opens your browser’s print dialog — choose “Save as PDF” for a Zenodo-ready file.