Red-Teaming vs Automated Evals: Tradeoffs and When to Use Each

Swastik Roy

Blog Post

Red-Teaming vs Automated Evals: Tradeoffs and When to Use Each

Human red-teaming finds attacks automated evals miss. Automated evals achieve scale humans can't. Here's how to combine them, and what each can and can't tell you.

June 19, 2024Views: –5 min readCite

responsible-ai red-teaming evaluation safety llm

Human red-teamers are creative in ways automated adversarial generators structurally are not, and the difference is not a matter of effort that more compute will close. A human brings domain expertise — a chemistry PhD probing for bioweapons content discovers subtleties no prompt generator stumbles onto — along with cultural knowledge of harms specific to a region or language, and the strategic patience to build a multi-turn jailbreak that sets up its payload over a dozen messages. An automated generator trained on known attacks reliably rediscovers known attacks; it maps the documented attack surface thoroughly and the undocumented attack surface not at all. That is the entire case for keeping humans in the loop: they find the categories that do not exist yet.

What automation buys that humans cannot

Scale, and nothing else, but scale changes what is possible. A human red-teamer produces hundreds of attacks in a day; an automated pipeline produces millions, and three jobs depend on that volume. Regression testing — running the full attack suite after every model update to confirm new training did not reopen old holes — is hopeless by hand. Coverage across model versions — testing ten checkpoints overnight to see which one to ship — is hopeless by hand. Continuous monitoring — running safety evals against samples of live traffic — is hopeless by hand. None of these are intellectually hard; they are simply impossible at human throughput, and automation is the only thing that makes them routine.

LLM-as-judge

Using a separate LLM to score outputs for harm is now the default scoring mechanism, and it works well enough to deploy. The judge receives the original prompt, the model's output, and a rubric — a 1–5 harm scale with anchor descriptions for each level — and on well-defined harm categories judge models reach roughly 80–90% agreement with human annotators. The failure modes are specific and worth naming: judges sometimes refuse to engage with the very content they are meant to evaluate, an over-refusal that silently drops items; judges carry their own biases, so one fine-tuned on Western norms misses culturally specific harms; and judges can themselves be jailbroken by content crafted to manipulate the evaluation. A judge is a model, and every weakness of the model under test is a weakness the judge may share.

Calibrating judge agreement

Before trusting a judge you have to measure how much it agrees with humans beyond chance, and raw agreement overstates the case because two annotators will agree on easy items by luck alone. Cohen's κ corrects for that chance agreement, defined as

κ = (P_o − P_e) / (1 − P_e)

where P_o is the observed agreement between judge and human and P_e is the agreement expected by chance given each rater's label frequencies. A κ above 0.6 is generally considered adequate for safety annotation, but the threshold should rise with the stakes: for physical-harm or CSAM categories, demand κ above 0.8 and route every disagreement to a human. The number must be reported per harm category and never as a single aggregate, because a judge that is well-calibrated on bias detection can be badly calibrated on bioweapons content, and an averaged κ would launder that failure into a passing grade.

The pipeline that combines both

An effective safety-evaluation pipeline runs in six stages, and the division of labor between human and machine is deliberate at each one.

Human red-teamers define the attack taxonomy and write seed attacks — 100 to 1000 per category — establishing the categories that automation cannot invent.
Automated generation expands each seed into volume: a fine-tuned attacker LLM produces paraphrases, translated variants, and multi-turn versions.
The full attack set runs against the model under evaluation.
An LLM judge scores every output, and human reviewers validate a stratified sample — a random draw plus every output the judge marked high-confidence-harmful.
Human reviewers resolve the disagreements and feed the resolutions back into the judge's rubric.
For each new model release, automated regression reruns the whole suite, and humans review only the delta: the attacks whose score changed relative to the previous model.

The humans appear at the start, where judgment is irreplaceable, and at the points where the machine's confidence is least trustworthy. Everything in between is automated because nothing in between needs a human.

What a red-team result means, and what it does not

A report that says "we found 50 jailbreaks" tells you nothing about the model's safety on its own, because 50 is a numerator with no denominator. To interpret it you need the attack success rate — 50 successes out of how many attempts? — the coverage of which attack categories were tried, the severity distribution of the successful attacks, and a comparison baseline against the previous model and against competitors. A model with 50 successful attacks out of 10,000 tried is at a 0.5% success rate; a model with 50 out of 200 is at 25%. Those are the same 50 jailbreaks and entirely different models, and any safety claim that quotes the 50 without the denominator is not a measurement — it is a headline.

Red-Teaming vs Automated Evals: Tradeoffs and When to Use Each

What automation buys that humans cannot

LLM-as-judge

Calibrating judge agreement

The pipeline that combines both

What a red-team result means, and what it does not

How to cite this article

Cite this work