Synthetic Data for Alignment: Curation, Quality Filtering, and Self-Critique

Swastik Roy

Blog Post

Synthetic Data for Alignment: Curation, Quality Filtering, and Self-Critique

Human annotation doesn't scale to the data volumes modern alignment requires. Synthetic data — generated by LLMs, filtered, and refined — has become the dominant approach. Here's how it's done and where it breaks down.

June 19, 2024Views: –7 min readCite

alignment synthetic-data sft data-curation llm-training

A million high-quality instruction-response pairs at $20 an hour is roughly 50,000 person-hours — twenty-five annotators working for a year — and that buys the data for one training run of one model. Frontier labs run dozens of training runs over billions of tokens of instruction data, and the arithmetic simply does not close: human annotation cannot produce supervision at the volume modern alignment consumes. Synthetic data can, but only if you treat generation as the easy half and filtering as the half that determines whether any of it is usable. The alignment pipeline is bounded at each stage by the quality of the signal feeding it, and synthetic data is the lever that moves that bound — in both directions.

What "synthetic data" means concretely

For alignment, synthetic data is LLM-generated (prompt, response) pairs used for SFT, or (prompt, chosen, rejected) triples used for reward modeling or DPO. The generator is typically a stronger model — a frontier model used to produce training data for a weaker one — which makes the whole setup an instance of distillation: the weaker model learns to imitate the stronger model's response distribution. That framing is also the source of every failure mode later in this post, because imitating a distribution you cannot reproduce on your own is exactly where the wheels come off.

Prompt diversity is the real bottleneck

The diversity of a synthetic corpus is capped by the diversity of its prompts. Generating responses is the easy part; generating realistic, high-coverage prompts that span what users actually ask is the hard part, and several approaches trade off coverage against effort. Seed-and-expand starts from a small set of human-written prompts and asks the model for variations, which is cheap but clusters the variations around the seed distribution. Topic-guided generation defines a taxonomy — coding, writing, math, factual Q&A, safety-relevant — and generates per category, which guarantees breadth across the taxonomy but misses the long tail no taxonomy enumerates.

Self-Instruct (Wang et al., 2022) is the canonical iterative recipe: start from 175 seed tasks, use the model to generate new task descriptions and demonstrations, and discard near-duplicates by ROUGE overlap against the existing pool before adding what survives. The filter is a hard threshold on surface overlap.

\text{ROUGE\text{-}L}(\text{new}, \text{existing}) < 0.7

It keeps the corpus from collapsing into restatements of a handful of tasks, but it only measures surface n-gram overlap, so two prompts that are semantically identical while sharing few literal tokens both pass. Persona-driven generation attacks a different axis — generate prompts as if written by a student, an expert, a child, a non-native speaker — and improves stylistic diversity that topic taxonomies and ROUGE filters are both blind to.

Filtering responses is a pipeline, not a step

Not every generated response belongs in a training set, and the filtering pipeline is layered because each layer catches what the cheaper ones upstream miss. Rule-based filters go first and cheapest: drop responses with hallucinated citations, generator refusals, lengths below a floor, or leaked generator identity ("As Claude, I…"). Reward-model scoring comes next — run each (prompt, response) through an RM and keep what clears a threshold — with the obvious circularity that this needs a good RM, which is part of what you are trying to build; in practice you borrow a pre-existing general-purpose RM or a preference-trained classifier for the filter. LLM-as-judge adds a second model rating helpfulness, correctness, and safety, at the cost of roughly one judge call per generated response, and it catches failure modes the RM's single scalar smooths over. Deduplication closes the pipeline because near-duplicate responses add no training signal and only burn compute — SimHash or embedding-based dedup at the response level, n-gram dedup at the sentence level.

Constitutional AI generates preference data from a single response

Constitutional AI (Bai et al., 2022) turns one response into a preference pair without a human. Given an initial response that may be harmful, the model critiques it against a principle ("is this response harmful?") and then revises it based on its own critique, and the revision replaces the original as the SFT target. The byproduct is a labeled comparison for free: the pair (original_response, revised_response) is exactly a (rejected, chosen) pair for RM or DPO training. The critique-revise cycle iterates — revisions are themselves critiqued and revised — and in practice two or three rounds converge, which is the same Constitutional-AI machinery the learning-from-feedback post describes on the labeling side, viewed here as a data generator rather than a labeler.

Rejection sampling is RL without the policy gradient

Rejection-sampling fine-tuning — ReST, RAFT — generates $K$ responses per prompt with the current policy, scores all $K$ with the RM, keeps only those above a threshold, and fine-tunes on the survivors. It is a softer, more stable version of RL: you select for high reward without the variance of a policy-gradient update, and the loss is just SFT restricted to the high-reward pairs.

\mathcal{L}_\text{ReST}(\theta) = -\sum_{(x, y)\,:\, r(x, y) > \tau} \log \pi_\theta(y \mid x)

Because it is ordinary cross-entropy on a curated subset, it has none of the PPO loop's moving parts — no value network, no KL leash, no importance weights — which is exactly why it is stable and also why it is weaker than full RL at squeezing the last reward out.

The threshold $\tau$ is the knob with the sharp edges. Set it too high and you keep too few samples, starving the update of signal; set it too low and you fine-tune on mediocre responses and teach the model that mediocre is acceptable. The practical heuristic sidesteps the absolute scale entirely: keep the top- $K$ responses per prompt, where $K = 1$ is greedy best-of- $N$ and $K = 4$ to $8$ keeps diversity while holding quality, so the bar adapts per prompt instead of riding on a global $\tau$ the RM's calibration would have to justify.

Where synthetic data breaks down

The failure modes all trace back to the distillation framing. Generator mode collapse comes first: a strong model writes in its own voice, and fine-tuning a weaker model on that data teaches it the generator's style rather than its own — most visible on creative and stylistic tasks where "voice" is the point. Compounding errors are the reasoning analogue: when the generator makes one mistake in a multi-step chain, the synthetic example bakes that mistake in, and the trained model learns the wrong pattern with full confidence because the data never flags it as wrong. Distribution mismatch is the deepest one — the generator's responses are optimal for the generator's capabilities, so a smaller model can fine-tune on a chain of reasoning it cannot actually execute, learning to mimic the output without acquiring the underlying capability. And evaluation contamination is the one that hides all the others: if the same model generates training data and grades the trained model, the eval is measuring similarity to the generator, not capability, and every number looks better than it is.

The assembled pipeline

Put end to end, a working synthetic-data pipeline for alignment generates diverse prompts (topic-guided plus Self-Instruct expansion plus persona-driven), samples multiple responses per prompt from the strongest available model, and filters them through the rule-based → RM → LLM-judge → dedup stack above. For DPO or RM training it uses Constitutional-AI critique-and-revise to turn single responses into (chosen, rejected) pairs; for SFT it uses rejection sampling to keep the top- $K$ per prompt. Then it iterates — train on this generation of synthetic data, use the trained model to generate the next generation — and that loop converges when the trained model is indistinguishable from the generator on the target task distribution, which is the same thing as saying you have reached the ceiling of what the generator can teach. Past that point the generator has to improve, or a human has to step back in, because no amount of resampling extracts capability the source never had.

Synthetic Data for Alignment: Curation, Quality Filtering, and Self-Critique

What "synthetic data" means concretely

Prompt diversity is the real bottleneck

Filtering responses is a pipeline, not a step

Constitutional AI generates preference data from a single response

Rejection sampling is RL without the policy gradient

Where synthetic data breaks down

The assembled pipeline

How to cite this article

Cite this work