Tag: rlhf

Blog Post·2026-06-20·10 min read

The Bradley–Terry Model: From ELO Scores to Reward Models

Chatbot Arena ranks LLMs with ELO, InstructGPT trains a reward model on pairwise preferences, and chess has rated players for seventy years. All three rest on the same one-line probabilistic model — Bradley–Terry — which turns out to be logistic regression over comparisons.

preference-learning rlhf reward-models elo ranking

Blog Post·2024-06-19·8 min read

The LLM Alignment Pipeline: SFT, Reward Models, and RL End to End

Training a helpful, harmless, honest LLM requires three sequential stages that each build on the previous one. Here's how SFT, reward modeling, and RL fit together as a system — and where each stage can fail.

alignment rlhf sft reward-model llm-training

Blog Post·2024-06-19·7 min read

The Full PPO Training Step: Value Loss, Entropy, and GAE

ppo-loss-per-token covered the clipped surrogate objective. This post covers what surrounds it: how the value function is trained, where the advantage estimates come from, and why the entropy bonus exists.

rl ppo llm-training rlhf

Blog Post·2024-06-19·5 min read

How PPO Computes Loss Over a Language Model Output

Most explanations of PPO stay at the algorithm level. This post goes one level deeper: how the surrogate loss is actually computed token by token for a language model response.

rl llm-training rlhf

Blog Post·2024-06-19·7 min read

Learning from Feedback: RLHF, RLAIF, and Beyond

RLHF is three steps: supervised fine-tuning, reward model training, and policy optimization. Each step has a specific failure mode. Here's the full picture.

rl rlhf alignment llm-training