Blog Post··4 min read
Why Dropout Disappeared from Large Language Models
BERT used dropout everywhere. LLaMA uses none. The reason isn't that regularization stopped mattering — it's that at trillion-token scale, data diversity IS the regularizer.
BERT used dropout everywhere. LLaMA uses none. The reason isn't that regularization stopped mattering — it's that at trillion-token scale, data diversity IS the regularizer.
Padding wastes GPU compute. Sequence packing eliminates it — but introduces cross-document attention contamination unless you explicitly mask it. Here's what the attention mask actually looks like.