Chatbot Arena ranks LLMs with ELO, InstructGPT trains a reward model on pairwise preferences, and chess has rated players for seventy years. All three rest on the same one-line probabilistic model — Bradley–Terry — which turns out to be logistic regression over comparisons.
Training a helpful, harmless, honest LLM requires three sequential stages that each build on the previous one. Here's how SFT, reward modeling, and RL fit together as a system — and where each stage can fail.
ppo-loss-per-token covered the clipped surrogate objective. This post covers what surrounds it: how the value function is trained, where the advantage estimates come from, and why the entropy bonus exists.
Most explanations of PPO stay at the algorithm level. This post goes one level deeper: how the surrogate loss is actually computed token by token for a language model response.
RLHF is three steps: supervised fine-tuning, reward model training, and policy optimization. Each step has a specific failure mode. Here's the full picture.