Tag: evaluation

Blog Post·2024-06-19·8 min read

Evaluation Metrics: Precision, Recall, Calibration, and Confidence

How do you measure whether a model is actually good? The answer is a set of metrics — precision, recall, F1, perplexity, calibration, confidence intervals — each measuring something different and failing in a different way.

math evaluation metrics statistics llm

Blog Post·2024-06-19·5 min read

Red-Teaming vs Automated Evals: Tradeoffs and When to Use Each

Human red-teaming finds attacks automated evals miss. Automated evals achieve scale humans can't. Here's how to combine them, and what each can and can't tell you.

responsible-ai red-teaming evaluation safety llm

Blog Post·2024-06-19·6 min read

Designing Safety Benchmarks for LLMs: What Makes an Eval Good

Most safety benchmarks are gameable, distribution-shifted, or measure the wrong thing. Here's what separates a rigorous safety evaluation from a checkbox.

responsible-ai safety evaluation benchmarking