How do you measure whether a model is actually good? The answer is a set of metrics — precision, recall, F1, perplexity, calibration, confidence intervals — each measuring something different and failing in a different way.
Human red-teaming finds attacks automated evals miss. Automated evals achieve scale humans can't. Here's how to combine them, and what each can and can't tell you.
Most safety benchmarks are gameable, distribution-shifted, or measure the wrong thing. Here's what separates a rigorous safety evaluation from a checkbox.