Tag: normalization

Blog Post·2024-06-19·4 min read

Normalization in Transformers: Why Pre-LN Became the Default

Post-LN dominated the original transformer. Pre-LN dominates everything since GPT-2. The reason comes down to gradient flow — and the math is clean enough to be worth understanding.

transformers normalization llm-training