GRPO vs PPO: Why Removing the Value Head Changes Everything
GRPO achieves competitive alignment results without a value function. Here's exactly what changes in the math and implementation, and why that matters for training efficiency and stability.
GRPO achieves competitive alignment results without a value function. Here's exactly what changes in the math and implementation, and why that matters for training efficiency and stability.
ppo-loss-per-token covered the clipped surrogate objective. This post covers what surrounds it: how the value function is trained, where the advantage estimates come from, and why the entropy bonus exists.
Most explanations of PPO stay at the algorithm level. This post goes one level deeper: how the surrogate loss is actually computed token by token for a language model response.
Single-turn RL teaches a model to produce good responses. Agentic RL teaches it to complete multi-step tasks in an environment — with delayed rewards, partial observability, and real consequences.
RLHF is three steps: supervised fine-tuning, reward model training, and policy optimization. Each step has a specific failure mode. Here's the full picture.
The policy gradient theorem lets you differentiate through a reward signal you can't backprop through. Here's the derivation and why it works.
The reward signal determines what the model learns to do. Swap the reward, swap the capability. Here's how RL elicits reasoning, code generation, math, and tool use.
Supervised fine-tuning teaches a model to imitate. Reinforcement learning teaches it to optimize. The difference turns out to matter enormously.