Tag: rl

Blog Post·2024-06-19·7 min read

GRPO vs PPO: Why Removing the Value Head Changes Everything

GRPO achieves competitive alignment results without a value function. Here's exactly what changes in the math and implementation, and why that matters for training efficiency and stability.

rl grpo ppo llm-training alignment

Blog Post·2024-06-19·7 min read

The Full PPO Training Step: Value Loss, Entropy, and GAE

ppo-loss-per-token covered the clipped surrogate objective. This post covers what surrounds it: how the value function is trained, where the advantage estimates come from, and why the entropy bonus exists.

rl ppo llm-training rlhf

Blog Post·2024-06-19·5 min read

How PPO Computes Loss Over a Language Model Output

Most explanations of PPO stay at the algorithm level. This post goes one level deeper: how the surrogate loss is actually computed token by token for a language model response.

rl llm-training rlhf

Blog Post·2024-06-19·7 min read

RL for Agentic Systems

Single-turn RL teaches a model to produce good responses. Agentic RL teaches it to complete multi-step tasks in an environment — with delayed rewards, partial observability, and real consequences.

rl agents systems llm-training

Blog Post·2024-06-19·7 min read

Learning from Feedback: RLHF, RLAIF, and Beyond

RLHF is three steps: supervised fine-tuning, reward model training, and policy optimization. Each step has a specific failure mode. Here's the full picture.

rl rlhf alignment llm-training

Blog Post·2024-06-19·6 min read

Policy Gradients: The Math Behind RLHF

The policy gradient theorem lets you differentiate through a reward signal you can't backprop through. Here's the derivation and why it works.

rl policy-gradient llm-training

Blog Post·2024-06-19·7 min read

RL as a Skill Acquisition Engine

The reward signal determines what the model learns to do. Swap the reward, swap the capability. Here's how RL elicits reasoning, code generation, math, and tool use.

rl reasoning code llm-training

Blog Post·2024-06-19·5 min read

Why Language Models Need Reinforcement Learning

Supervised fine-tuning teaches a model to imitate. Reinforcement learning teaches it to optimize. The difference turns out to matter enormously.

rl llm-training sft