LLM Training Stages: Pre-training, Mid-training, SFT, RL, and DPO

Blog Post

LLM Training Stages: Pre-training, Mid-training, SFT, RL, and DPO

What actually happens at each stage of training a large language model — what data, what objective, what the model learns, and why the stages are ordered the way they are.

June 20, 2025Views: –13 min readCite

llm-training pretraining mid-training sft rlhf dpo alignment

Modern LLMs are trained in a sequence of distinct stages, each with different data, a different objective function, and a different goal. Treating "fine-tuning" as a monolith — or conflating SFT with RLHF, or assuming DPO is just cheaper RLHF — leads to poor design decisions and misplaced debugging effort. This post maps the full arc: from raw web text to an instruction-following, preference-aligned assistant, stage by stage.

Stage 1: Pre-training

Goal: build a general world model from raw text.

Pre-training is where virtually all of a model's knowledge comes from. The training signal is next-token prediction: given a sequence of tokens $x_1, \ldots, x_{t-1}$ , predict $x_t$ by minimizing cross-entropy loss over the training corpus:

$\mathcal{L}_{\text{PT}} = -\sum_{t=1}^{T} \log P_\theta(x_t \mid x_1, \ldots, x_{t-1})$

There are no labels, no human feedback, and no task specification — just the statistical structure of text at scale.

Data: web-scale text assembled from Common Crawl, books, code repositories, scientific papers, and Wikipedia. The quality of this data matters at least as much as its quantity. FineWeb (from Hugging Face), Dolma, and RedPajama are canonical examples of open, deduplicated, quality-filtered pre-training corpora. The filtering pipeline typically includes: language identification, quality heuristics (perplexity filtering, n-gram deduplication), and explicit block-listing of known toxic sources.

Scale: frontier models are trained on 1T–15T tokens. The Chinchilla scaling laws give an approximate compute-optimal allocation: for a compute budget $C$ (in FLOPs), the optimal model size $N$ and dataset size $D$ satisfy

$N \propto \sqrt{C}, \quad D \propto \sqrt{C}$

so doubling compute should roughly double both parameters and data, not just parameters. In practice, inference cost pushes post-Chinchilla models toward training smaller models on more data than strictly compute-optimal — Llama 2 7B on 2T tokens is the canonical example.

What the model learns: syntax, factual associations, reasoning patterns, code, math — everything that emerges from statistical co-occurrence in text. The model is an extraordinarily capable text completer. It has no concept of a "turn," no concept of a "user," and no preference for being helpful or harmless. Given the prompt "How do I make a bomb?", it will continue in the most statistically likely direction, which may well be an instructional one.

Duration: weeks to months on thousands of GPUs. Pre-training is by far the most compute-intensive stage.

Output: a base model. GPT-3 (Brown et al., 2020), Llama 2-base, and Mistral-7B-base are examples. You can prompt-engineer them with few-shot examples, but they don't robustly follow instructions and have no safety behaviors.

Stage 2: Mid-training (Continued Pre-training / Domain Adaptation)

Goal: shift the model's knowledge distribution toward a target domain or capability, without forgetting general knowledge.

Mid-training is not universally used — many pipelines skip directly from pre-training to SFT — but it is common in several scenarios:

Domain specialization: medical (MedPaLM), legal, coding (Code Llama), or mathematical reasoning. The model is continued on domain-specific text at a much smaller scale than pre-training.
Long-context extension: many models are pre-trained at 4K context and then continued at 32K or 128K with positional encoding adjustments (RoPE scaling, YaRN). The KV cache dynamics at long contexts are sufficiently different that a separate training phase is warranted.
Language adaptation: continuing on a non-English corpus to improve multilingual capability without training a new base model from scratch.
Knowledge refreshing: continuing on more recent web data when the pre-training corpus has a hard cutoff.

Objective: same as pre-training — next-token prediction on the new data — but with the domain corpus mixed with a fraction of general-domain data to prevent catastrophic forgetting. The mixing ratio is a hyperparameter that requires careful tuning; too little general data and the model forgets broadly; too much and the domain signal is diluted.

Examples in the literature: Code Llama (Touvron et al., 2023) continued Llama 2 on 500B code tokens from a code-focused corpus. Phi-3.5 used "textbook quality" synthetic data curated for reasoning density. MedPaLM extended PaLM on a medical literature corpus before any instruction tuning.

Key risk: catastrophic forgetting. If the domain corpus is much smaller than the general corpus and the mixing ratio is wrong, the model can lose general reasoning ability rapidly. Regularization approaches (EWC, replay buffers of general data) are often used but add complexity.

Stage 3: Supervised Fine-Tuning (SFT)

Goal: teach the model to follow instructions, adopt a conversational format, and produce the output style users expect.

SFT is the first stage that involves human-authored (or human-curated) demonstrations. The training data is a set of (prompt, ideal response) pairs. The objective is next-token prediction on the response tokens, with the loss masked on the prompt:

$\mathcal{L}_{\text{SFT}} = -\sum_{t \in \text{response}} \log P_\theta(x_t \mid x_{<t})$

This is mechanically identical to pre-training but applied to a much smaller, much higher-quality dataset, and only computing loss over the response half of each example.

What changes — and what doesn't: SFT is fundamentally a format transfer, not a knowledge transfer. The model learns what a "turn" looks like, how to conclude a response, how to follow the instruction's implicit style constraints. Factual knowledge doesn't meaningfully improve during SFT — the pre-training corpus is orders of magnitude larger, so SFT-scale data cannot shift factual associations measurably. This is why hallucinations cannot be fixed by SFT alone.

Data quality vs. quantity: LIMA (Zhou et al., 2023) showed that 1,000 carefully curated demonstrations — selected for diversity of task type, clarity of prompt, and quality of response — were competitive with models trained on 50K examples from noisier datasets. The implication is strong: for SFT, curation and quality dominate volume.

Data sources: Human-written demonstrations (expensive, slow), model-generated demonstrations filtered by quality (Alpaca, WizardLM, OpenHermes), and combinations. The risk with model-generated data is distributional collapse — if the teacher model has systematic errors, the student inherits them.

Limitations: SFT can only teach the model to produce the kinds of responses present in the demonstration data. It cannot teach the model to avoid behaviors that were never demonstrated; it cannot teach preference — i.e., that response A is better than response B when both are valid formats. Those require a feedback signal from humans or a trained reward model.

Output: an instruction-following model. Llama-3-8B-Instruct after SFT (before any RL) is a good example. It will follow instructions robustly, but may be sycophantic, may fail on ambiguous preference questions, and may not have internalized safety behaviors consistently.

Stage 4a: RLHF / PPO / GRPO

Goal: align the model's outputs with human preferences — make it helpful, harmless, and honest in ways that are not fully captured by demonstrations alone.

RLHF (Reinforcement Learning from Human Feedback), as introduced in InstructGPT (Ouyang et al., 2022), is a two-phase process.

Phase 1 — Reward model training: collect comparison data: for the same prompt, show a human two model responses and ask which is better. Fit a reward model $r_\phi$ that predicts a scalar reward from a (prompt, response) pair, trained to rank the preferred response above the rejected one under the Bradley-Terry model:

$\mathcal{L}_{\text{RM}} = -\mathbb{E}_{(x, y_w, y_l)} \left[ \log \sigma(r_\phi(x, y_w) - r_\phi(x, y_l)) \right]$

where $y_w$ is the preferred (winning) response and $y_l$ is the rejected (losing) one. The reward model is typically initialized from the SFT model with a scalar head replacing the language model head.

Phase 2 — RL fine-tuning: use the reward model as the reward signal and optimize the policy $\pi_\theta$ (initialized from the SFT model $\pi_{\text{SFT}}$ ) via policy gradient, with a KL penalty to prevent the policy from drifting too far from the SFT model:

$\mathcal{L}_{\text{RL}} = \mathbb{E}_{x \sim \mathcal{D},\, y \sim \pi_\theta(\cdot|x)} \left[ r_\phi(x, y) - \beta \log \frac{\pi_\theta(y|x)}{\pi_{\text{SFT}}(y|x)} \right]$

PPO (Proximal Policy Optimization) is the canonical algorithm here, using a clipped surrogate objective and a separate value head to estimate per-token advantages.

GRPO (Group Relative Policy Optimization), introduced in DeepSeek-R1 (DeepSeek-AI, 2025), removes the separate value head. Instead of learning a value function, it computes the advantage of each response relative to the mean reward of a group of responses sampled for the same prompt. This makes the training significantly cheaper and removes the value model collapse failure mode that affects PPO at long horizons.

What changes: the model learns to produce outputs that score well under the reward model, which is trained to reflect human preferences. This can teach behaviors not present in the SFT demonstrations, including refusals, caveats, and calibrated uncertainty. The RL stage is also where reasoning models learn to produce long chains of thought — the reward comes from final answer correctness, not from intermediate steps, so the model discovers that longer, more structured reasoning improves the reward.

Key risks: reward hacking (the policy finds behaviors that maximize the reward model's score without actually being good), mode collapse (the policy collapses onto a few high-reward response patterns), and training instability. Running RL in a loop is expensive — each gradient update requires generating complete responses from the current policy before computing rewards.

Stage 4b: DPO (Direct Preference Optimization)

Goal: same alignment objective as RLHF, but without a separate reward model or RL training loop.

DPO (Rafailov et al., 2023) reformulates the RLHF objective directly as a supervised loss on preference pairs. The key insight is that under the Bradley-Terry preference model, the optimal policy $\pi^*$ can be expressed in closed form in terms of the reward function $r^*$ and the reference policy $\pi_{\text{ref}}$ :

$r^*(x, y) = \beta \log \frac{\pi^*(y|x)}{\pi_{\text{ref}}(y|x)} + \beta \log Z(x)$

Substituting this implicit reward into the Bradley-Terry loss eliminates $r^*$ and $Z(x)$ entirely, yielding a loss that depends only on $\pi_\theta$ and $\pi_{\text{ref}}$ :

$\mathcal{L}_{\text{DPO}} = -\mathbb{E}_{(x, y_w, y_l)} \left[ \log \sigma\!\left( \beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)} \right) \right]$

The policy is trained to increase the log-likelihood of the preferred response and decrease the log-likelihood of the rejected response, relative to the reference policy. No RL sampling loop, no reward model, no value head.

Advantages: DPO is more stable than PPO, requires no reward model infrastructure, uses the same (prompt, $y_w$ , $y_l$ ) data format as RLHF, and is straightforward to implement on top of standard language model training. It is the default alignment stage for many open-weight models.

Disadvantages: DPO is offline — the preference pairs are fixed at collection time and drawn from some behavioral distribution, not from the current policy. As the policy deviates from the reference, the training distribution becomes stale. This hurts on tasks requiring complex, multi-step reasoning where the quality of the current policy's outputs matters for the quality of the preference signal. DPO can also overfit to the reference distribution and struggle to generalize to out-of-distribution prompts.

Variants in active use:

IPO (Identity Preference Optimization): replaces the log-sigmoid with a squared loss to avoid saturation.
KTO: replaces preference pairs with scalar labels (thumbs up / thumbs down per response, rather than comparisons), removing the dependency on paired data.
SimPO: removes the reference model entirely by using response length-normalized rewards as the implicit signal.
ORPO (Odds Ratio Preference Optimization): combines the SFT cross-entropy loss and the preference alignment loss in a single training pass, eliminating the need for a separate SFT stage.

Why the Order Matters

The stages are not interchangeable. Several ordering constraints are fundamental:

Pre-training must come first. No downstream stage can inject world knowledge that wasn't acquired in pre-training. SFT on a tiny corpus cannot teach facts; it can only rearrange how already-known facts are expressed. Forgetting this leads to the common mistake of trying to update a model's knowledge via fine-tuning instead of retrieval augmentation.

SFT must precede RL. The RL stage requires sampling rollouts from the current policy and scoring them. If the policy hasn't learned to follow instructions, its rollouts are incoherent continuations, not responses to prompts — there's nothing meaningful to compare or score. The SFT stage is what installs the concept of a "turn" and makes the model's outputs comparable.

DPO requires a reference policy. By construction, DPO's loss is defined relative to $\pi_{\text{ref}}$ , which is the SFT model. The SFT model must exist and be fixed before DPO training begins.

SFT after RL can erase alignment. Fine-tuning an RLHF-aligned model on new demonstration data — even benign data — can partially reverse the alignment, because the SFT loss updates all tokens in the response equally, which can overwrite the subtle distributional shifts that RL introduced. This has been documented empirically as the "alignment tax reversal" problem.

Stage Comparison

Stage	Data	Objective	What changes	When to skip
Pre-training	Web-scale text	Next-token prediction (all tokens)	World model, knowledge, language	Never
Mid-training	Domain text	Next-token prediction	Domain knowledge, context length	If no domain focus needed
SFT	Demonstrations (prompt + response)	Next-token (response tokens only)	Format, style, instruction following	Rarely
RLHF/PPO/GRPO	Reward signal (RM or verifiable)	Policy gradient	Preference alignment, reasoning depth	If DPO suffices
DPO	Preference pairs $(x, y_w, y_l)$	Bradley-Terry cross-entropy	Preference alignment	If online RL is required

Emerging Directions

RLVR (Reinforcement Learning from Verifiable Rewards): instead of training a reward model from human comparisons, use a programmatic reward — a math answer checker, a code execution harness, or a formal verifier. The reward signal is binary (correct / incorrect) but ground-truth accurate and free from reward model overfitting. DeepSeek-R1 used RLVR for its math and coding capabilities, producing its distinctive long chain-of-thought reasoning style as an emergent consequence of optimizing for final-answer correctness.

Constitutional AI / RLAIF: use an LLM as the judge rather than human annotators. A "constitution" (a set of principles) is used to generate AI-preference labels at scale, reducing dependence on expensive human annotation. The critique-revision loop allows iterative self-improvement without human involvement at each step.

Iterated / Online DPO: to address DPO's offline limitation, generate new responses with the current policy checkpoint, collect preference labels on those responses, and retrain. Each iteration uses on-policy data, approximating the online distribution-matching that makes RL effective. Early empirical results suggest iterated DPO closes much of the gap with PPO on reasoning tasks at lower compute cost.

ORPO: by folding SFT and preference alignment into a single loss, ORPO eliminates the need for a separate SFT phase and a reference model checkpoint. The training is simpler, the pipeline is shorter, and results on instruction-following benchmarks are competitive with two-stage SFT+DPO. Very new, but promising for resource-constrained training scenarios.

The arc from base model to instruction-following, preference-aligned assistant is not a single fine-tuning step — it is a carefully ordered pipeline where each stage solves a problem the previous stage left open, and where skipping or reordering stages has predictable failure modes. The next post in this series examines mid-training in depth: when it is worth the compute, how to set the mixing ratio, and what the empirical evidence says about domain-adapted models.

References

Brown, T., et al. (2020). Language Models are Few-Shot Learners. arXiv preprint arXiv:2005.14165. https://doi.org/10.48550/arXiv.2005.14165
Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155. https://doi.org/10.48550/arXiv.2203.02155
Zhou, C., et al. (2023). LIMA: Less Is More for Alignment. arXiv preprint arXiv:2305.11206. https://doi.org/10.48550/arXiv.2305.11206
Rafailov, R., et al. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. arXiv preprint arXiv:2305.18290. https://doi.org/10.48550/arXiv.2305.18290
Touvron, H., et al. (2023). Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv preprint arXiv:2307.09288. https://doi.org/10.48550/arXiv.2307.09288
DeepSeek-AI (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv preprint arXiv:2501.12948. https://doi.org/10.48550/arXiv.2501.12948

[1] Brown, T., et al. (2020). Language Models are Few-Shot Learners. arXiv preprint arXiv:2005.14165. https://doi.org/10.48550/arXiv.2005.14165

[2] Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155. https://doi.org/10.48550/arXiv.2203.02155

[3] Zhou, C., et al. (2023). LIMA: Less Is More for Alignment. arXiv preprint arXiv:2305.11206. https://doi.org/10.48550/arXiv.2305.11206

[4] Rafailov, R., et al. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. arXiv preprint arXiv:2305.18290. https://doi.org/10.48550/arXiv.2305.18290

[5] Touvron, H., et al. (2023). Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv preprint arXiv:2307.09288. https://doi.org/10.48550/arXiv.2307.09288

[6] DeepSeek-AI (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv preprint arXiv:2501.12948. https://doi.org/10.48550/arXiv.2501.12948

LLM Training Stages: Pre-training, Mid-training, SFT, RL, and DPO

Stage 1: Pre-training

Stage 2: Mid-training (Continued Pre-training / Domain Adaptation)

Stage 3: Supervised Fine-Tuning (SFT)

Stage 4a: RLHF / PPO / GRPO

Stage 4b: DPO (Direct Preference Optimization)

Why the Order Matters

Stage Comparison

Emerging Directions

References

How to cite this article

Cite this work