GRPO vs PPO: Why Removing the Value Head Changes Everything
GRPO achieves competitive alignment results without a value function. Here's exactly what changes in the math and implementation, and why that matters for training efficiency and stability.
GRPO achieves competitive alignment results without a value function. Here's exactly what changes in the math and implementation, and why that matters for training efficiency and stability.
Training a helpful, harmless, honest LLM requires three sequential stages that each build on the previous one. Here's how SFT, reward modeling, and RL fit together as a system — and where each stage can fail.
The loss function is the specification. Everything the model learns is in service of minimizing it. Here's the math behind every major loss used in LLM training and fine-tuning.
RLHF is three steps: supervised fine-tuning, reward model training, and policy optimization. Each step has a specific failure mode. Here's the full picture.
Human annotation doesn't scale to the data volumes modern alignment requires. Synthetic data — generated by LLMs, filtered, and refined — has become the dominant approach. Here's how it's done and where it breaks down.