Blog Post
What Actually Happens to Padding Tokens During LLM Pretraining
Padding wastes GPU compute. Sequence packing eliminates it — but introduces cross-document attention contamination unless you explicitly mask it. Here's what the attention mask actually looks like.
Views: –5 min readCite
Take a batch of four sequences with lengths . To stack them into a single tensor the framework pads each one out to the longest, , filling the gaps with a reserved pad token. The real tokens number ; the padded batch holds positions. Roughly 53% of the positions in this batch are padding, and the GPU runs attention and feed-forward math on every one of them. Scale that ratio across a trillion-token training run and you are spending hundreds of billions of FLOPs producing activations you immediately throw away.
What the pad token does and doesn't touch
Two separate masks govern a pad token, and conflating them is the usual source of confusion. The loss mask zeros out pad positions so they contribute no gradient — the model is never trained to predict a pad token, and a pad token's own (garbage) prediction never enters the loss. That part is unambiguous.
The attention mask is where the misconception lives. In many standard implementations padding tokens are not blocked from the attention computation: a pad position still has a query, still attends, and can still be attended to. What saves correctness is the causal mask. Under causal attention a real token at position can only see positions , and because pad tokens are appended at the end of each sequence, every real token sits before every pad token and therefore never attends to one. The pad tokens attend to real tokens and to each other, produce hidden states, and those states are then discarded by the loss mask. Quality is fine; compute is wasted. The padding problem is an efficiency problem, not a correctness one.
Packing: stop padding, start concatenating
Sequence packing removes the waste by abandoning one-sequence-per-row. Instead you concatenate
documents end to end into a single stream and cut it at the context length, so a packed row looks like
filled almost exactly to max_len
with real tokens and essentially zero padding. One detail is mandatory: the position IDs must reset
to 0 at each document boundary rather than counting monotonically across the row, because RoPE and
other positional schemes encode distance, and you do not want the model to believe the first token of
is thousands of positions after the start of .
The contamination that packing introduces
Packing trades the padding problem for a subtler one. The standard causal mask lets token attend to every earlier position , and in a packed row those earlier positions include the entire previous document.
So the first token of attends across the boundary into all of , treating an unrelated document as its prefix. This is spurious context the model never sees at inference, where a prompt arrives clean with no random preceding document bolted to its front. Train on enough of it and the model learns to condition on cross-document junk, degrading exactly the behavior you care about.
The fix: mask attention to document boundaries
The repair is to forbid attention from crossing into a different document by intersecting the causal
condition with a same-document condition, using a doc_id array that labels which document each
position belongs to.
A token may attend to an earlier position only when that position is causal and lives in the same
document. FlashAttention's varlen mode implements this without ever materializing the full mask: you
pass cu_seqlens, the cumulative-sequence-lengths array marking each document's start offset, and the
kernel restricts every query's attention to its own segment.
What the mask looks like
Make it concrete with a packed row of two documents, of length 3 and of length 4, for 7 positions total. A plain causal mask is a single lower-triangular block over all 7 positions, letting 's rows reach back into 's columns. The document-masked version is block-diagonal: a lower-triangular block for in the top-left, a lower-triangular block for in the bottom-right, and zeros everywhere off the diagonal so the two documents are attention-isolated.
The top-right quadrant is all zeros — that quadrant is exactly the cross-document attention the plain causal mask would have allowed, and zeroing it is the entire fix.
The payoff
For corpora dominated by short documents — instruction-tuning examples, Q&A pairs, chat turns — most sequences are far shorter than the context window, so padding dominates and packing moves effective GPU utilization from around 60% to roughly 98%. You stop paying for thrown-away pad activations, and as long as the attention mask is block-diagonal you pay nothing in quality for it. Packing without the document mask is the trap: you recover the compute and quietly corrupt the model's notion of context.