Tag: llm-training

Blog Post·2024-06-19·7 min read

GRPO vs PPO: Why Removing the Value Head Changes Everything

GRPO achieves competitive alignment results without a value function. Here's exactly what changes in the math and implementation, and why that matters for training efficiency and stability.

rl grpo ppo llm-training alignment

Blog Post·2024-06-19·8 min read

The LLM Alignment Pipeline: SFT, Reward Models, and RL End to End

Training a helpful, harmless, honest LLM requires three sequential stages that each build on the previous one. Here's how SFT, reward modeling, and RL fit together as a system — and where each stage can fail.

alignment rlhf sft reward-model llm-training

Blog Post·2024-06-19·18 min read

From Likelihood to Perplexity: One Unified Reference

Probability, likelihood, log-likelihood, NLL, expectation, entropy, cross-entropy, KL divergence, and perplexity are constantly confused for one another. They are not the same thing — but they are all the same thing seen from different angles. This is the definitive deep-dive that names each one precisely and shows exactly how they connect for LLMs.

math probability information-theory perplexity llm-training

Blog Post·2024-06-19·6 min read

Activation Functions: The Nonlinearity That Makes Neural Networks Work

Without nonlinearity, stacking layers collapses to a single matrix multiplication. Activation functions break that linearity — and the choice of which one determines expressivity, gradient flow, and training efficiency.

math activation-functions neural-networks llm-training

Blog Post·2024-06-19·5 min read

Loss Functions: What You Optimize Is What You Get

The loss function is the specification. Everything the model learns is in service of minimizing it. Here's the math behind every major loss used in LLM training and fine-tuning.

math loss-functions llm-training alignment

Blog Post·2024-06-19·6 min read

Optimization for LLMs: Gradient Descent to Adam

Training a neural network is an optimization problem: minimize a loss function over billions of parameters. The journey from vanilla gradient descent to Adam reveals why each step was necessary.

math optimization gradient-descent adam llm-training

Blog Post·2024-06-19·5 min read

Probability for LLMs: Distributions, Entropy, and KL Divergence

A language model is a probability distribution over sequences. Training it means pushing that distribution toward the data distribution. The math of how you measure and minimize that gap is what this post covers.

math probability information-theory llm-training

Blog Post·2024-06-19·7 min read

Putting It Together: The Mathematics of a Training Run

A single training step involves linear algebra, probability, information theory, optimization, and statistical estimation — all at once. Here's how the pieces fit.

math llm-training transformers synthesis

Blog Post·2024-06-19·7 min read

Statistical Foundations: Distributions, Estimation, and Sampling

Language models are probabilistic systems. Understanding the statistical machinery behind maximum likelihood estimation, Bayesian inference, and sampling algorithms clarifies why training and decoding work the way they do.

math statistics sampling bayesian llm-training

Blog Post·2024-06-19·7 min read

The Full PPO Training Step: Value Loss, Entropy, and GAE

ppo-loss-per-token covered the clipped surrogate objective. This post covers what surrounds it: how the value function is trained, where the advantage estimates come from, and why the entropy bonus exists.

rl ppo llm-training rlhf

Blog Post·2024-06-19·5 min read

How PPO Computes Loss Over a Language Model Output

Most explanations of PPO stay at the algorithm level. This post goes one level deeper: how the surrogate loss is actually computed token by token for a language model response.

rl llm-training rlhf

Blog Post·2024-06-19·7 min read

RL for Agentic Systems

Single-turn RL teaches a model to produce good responses. Agentic RL teaches it to complete multi-step tasks in an environment — with delayed rewards, partial observability, and real consequences.

rl agents systems llm-training

Blog Post·2024-06-19·7 min read

Learning from Feedback: RLHF, RLAIF, and Beyond

RLHF is three steps: supervised fine-tuning, reward model training, and policy optimization. Each step has a specific failure mode. Here's the full picture.

rl rlhf alignment llm-training

Blog Post·2024-06-19·6 min read

Policy Gradients: The Math Behind RLHF

The policy gradient theorem lets you differentiate through a reward signal you can't backprop through. Here's the derivation and why it works.

rl policy-gradient llm-training

Blog Post·2024-06-19·7 min read

RL as a Skill Acquisition Engine

The reward signal determines what the model learns to do. Swap the reward, swap the capability. Here's how RL elicits reasoning, code generation, math, and tool use.

rl reasoning code llm-training

Blog Post·2024-06-19·5 min read

Why Language Models Need Reinforcement Learning

Supervised fine-tuning teaches a model to imitate. Reinforcement learning teaches it to optimize. The difference turns out to matter enormously.

rl llm-training sft

Blog Post·2024-06-19·7 min read

Synthetic Data for Alignment: Curation, Quality Filtering, and Self-Critique

Human annotation doesn't scale to the data volumes modern alignment requires. Synthetic data — generated by LLMs, filtered, and refined — has become the dominant approach. Here's how it's done and where it breaks down.

alignment synthetic-data sft data-curation llm-training

Blog Post·2024-06-19·6 min read

Inside the FFN: MoE, SwiGLU, and the Architectural Details That Scale

The FFN block consumes most of a transformer's parameters. The choices made there — activation function, gating, expert routing — account for much of the quality gap between model families.

transformers moe swiglu architecture llm-training

Blog Post·2024-06-19·4 min read

Data Efficiency in Pretraining: Packing, Batching, and What Gets Wasted

Up to 30% of GPU compute can vanish into padding tokens that contribute nothing to learning. Here's how modern pretraining pipelines eliminate that waste.

transformers llm-training sequence-packing pretraining

Blog Post·2024-06-19·13 min read

Debugging Transformer Training Runs: Reading the Curves

Most training failures leave signatures in the metrics before they fully manifest. Here's how to read loss curves, gradient norms, learning rate schedules, and activation statistics to diagnose what's going wrong.

transformers llm-training debugging training-dynamics

Blog Post·2024-06-19·4 min read

Normalization in Transformers: Why Pre-LN Became the Default

Post-LN dominated the original transformer. Pre-LN dominates everything since GPT-2. The reason comes down to gradient flow — and the math is clean enough to be worth understanding.

transformers normalization llm-training

Blog Post·2024-06-19·4 min read

Optimizers for LLMs: Adam, Weight Decay, and Why Learning Rate Matters More Than You Think

Adam is the default optimizer for language model training, but using it correctly — the right β values, weight decay, learning rate schedule — makes a larger difference than most people expect.

transformers optimizer llm-training adam