Blog Post··4 min read
Normalization in Transformers: Why Pre-LN Became the Default
Post-LN dominated the original transformer. Pre-LN dominates everything since GPT-2. The reason comes down to gradient flow — and the math is clean enough to be worth understanding.