Why Dropout Disappeared from Large Language Models
BERT used dropout everywhere. LLaMA uses none. The reason isn't that regularization stopped mattering — it's that at trillion-token scale, data diversity IS the regularizer.
BERT used dropout everywhere. LLaMA uses none. The reason isn't that regularization stopped mattering — it's that at trillion-token scale, data diversity IS the regularizer.
LLM inference has two fundamentally different compute phases. Prefill processes the prompt in parallel and is compute-bound. Decode generates tokens one at a time and is memory-bandwidth-bound. Understanding both determines how you optimize.
A request-level walkthrough of how the KV cache is populated, grown, and read during LLM inference — covering prefill, decode, memory layout, and why decode is memory-bandwidth-bound.
GPT-4, Gemini, LLaMA, Mistral, DeepSeek, Qwen — they all build on the same transformer skeleton. But the architectural choices diverge sharply. Here's a systematic comparison across model families.
How do you measure whether a model is actually good? The answer is a set of metrics — precision, recall, F1, perplexity, calibration, confidence intervals — each measuring something different and failing in a different way.
Human red-teaming finds attacks automated evals miss. Automated evals achieve scale humans can't. Here's how to combine them, and what each can and can't tell you.
Training-time alignment is not enough. Production AI systems need runtime layers that detect, intercept, and respond to harmful inputs and outputs. Here's how to build them.
GPT-2 established the decoder-only transformer as the dominant paradigm. What followed was six years of systematic improvements — in scale, efficiency, alignment, and reasoning. Here's the arc.