Positional Encodings: From Sinusoids to RoPE

Swastik Roy

Blog Post

Positional Encodings: From Sinusoids to RoPE

Attention is permutation-invariant. Positional encodings break that symmetry. The choice of encoding method determines whether your model can generalize to longer sequences than it trained on.

June 19, 2024Views: –6 min readCite

transformers rope positional-encoding architecture

Self-attention is a set operation. Strip away everything else and what an attention layer computes is a weighted sum over value vectors, where the weights come from dot products between queries and keys — and a dot product does not care where its operands sat in the sequence. Feed the model "The cat sat" and feed it "sat cat The," and if the token embeddings are the same, the attention output is the same.

\text{Attn}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right) V

Permute the rows of $Q$ , $K$ , and $V$ identically and every entry of that softmax matrix permutes with them, so the output is just a permutation of the unpermuted result — order carries no information on its own. A language model that cannot tell "dog bites man" from "man bites dog" is useless, so something has to inject position before the dot products are taken. That something is the positional encoding, and the forty years of choices behind it are the difference between a model that breaks at its training length and one that reads a book it has never seen.

Absolute positions: sinusoids and lookup tables

The original transformer added a fixed sinusoidal signal to each token embedding before the first layer. Each dimension of the encoding is a sinusoid whose wavelength grows geometrically with the dimension index, so position $pos$ maps to a vector of alternating sines and cosines.

PE(pos, 2i) = \sin\!\left(\frac{pos}{10000^{2i/d}}\right), \qquad PE(pos, 2i+1) = \cos\!\left(\frac{pos}{10000^{2i/d}}\right)

Low dimensions oscillate fast and high dimensions oscillate slowly, so the full vector is a kind of binary-clock fingerprint that is unique for every position and smooth between neighbors. The signal is not learned, which means in principle it is defined for any position — including positions longer than anything seen in training — but in practice extrapolation degrades quickly, because the model learns to read these features at the scales it actually saw and has no calibration for the rest.

The lazier alternative is to skip the math and learn a position embedding table indexed by slot, exactly as you would learn a token embedding. BERT does this, and it works fine inside the training window. The cost is that the table has a fixed number of rows: a model trained at 512 positions has no embedding for position 513, so it does not extrapolate at all — it simply has nothing to look up.

Relative positions: bias the logits, not the embeddings

Both absolute schemes share a conceptual flaw. What a language model usually needs is not "this token is at absolute index 274" but "this token is three words to the left of that one." Shaw et al. acted on that directly by injecting a learned relative-position term into the attention logit itself, so the score between positions $i$ and $j$ depends on a vector keyed by their distance $i-j$ .

a_{ij} = \frac{(x_i W_Q)\,(x_j W_K + r_{i-j})^\top}{\sqrt{d_k}}

The bias $r_{i-j}$ is indexed only by the gap between the two tokens, never by where either sits in the sequence, so the same learned pattern of "look three back" applies whether the pair is at the start of a paragraph or a thousand tokens in. ALiBi strips this idea to its bones: it adds no learned vector at all, just a linear penalty on distance.

a_{ij} = \frac{q_i k_j^\top}{\sqrt{d_k}} - m \cdot |i - j|

The slope $m$ is fixed per head — some heads get a steep penalty and attend locally, others get a shallow one and see far — and because the penalty is a closed-form function of distance, ALiBi extrapolates to longer sequences essentially for free, which was its original selling point.

RoPE: rotation as relative position

The encoding that modern LLMs converged on takes the relative-position idea and hides it inside a rotation. Rotary position embedding leaves the attention formula untouched and instead rotates each query and key vector by an angle proportional to its position, pairing up dimensions $(2i, 2i+1)$ and spinning each pair by $p\theta_i$ where the per-pair frequency is $\theta_i = 10000^{-2i/d}$ .

\tilde{q}_p = R(p\theta)\, q, \qquad \tilde{k}_k = R(k\theta)\, k

The reason this is exactly relative encoding, despite looking absolute, is a property of rotation matrices: composing two rotations subtracts their angles inside a dot product, so the rotated query at position $p$ dotted with the rotated key at position $k$ depends only on the gap.

(R(p\theta)\, q) \cdot (R(k\theta)\, k) = q^\top R\big((k - p)\theta\big)\, k

The absolute positions $p$ and $k$ cancel and only $k - p$ survives, so RoPE delivers the relative-position behavior of Shaw or ALiBi without adding a single term to the logit — it is a transform applied to $q$ and $k$ before attention runs, and attention itself is none the wiser. The rotation is also norm-preserving, which matters: unlike additive biases it never inflates or shrinks the query and key magnitudes, so it does not quietly distort the softmax temperature. That clean separation is why RoPE composes with grouped-query attention and the rest of the inference stack without special cases, and it is why LLaMA, Mistral, Qwen, Gemini, and DeepSeek all use it.

Stretching the window after training

The payoff that made RoPE indispensable is what it lets you do at inference. Because position enters only through a rotation angle, you can rescale the angle to pretend a longer sequence is shorter — position interpolation divides every position index by the ratio of target length to training length before rotating.

p' = p \cdot \frac{L_\text{train}}{L_\text{target}}

A model trained at 4096 positions and asked to run at 32k has its position indices compressed back into the $[0, 4096)$ range the rotations were tuned for, so the angles stay in distribution and the model degrades gracefully instead of falling off a cliff. Naive interpolation flattens the high-frequency dimensions that encode local order, so the production recipes — NTK-aware scaling, and YaRN — interpolate the low frequencies while leaving the high ones nearly untouched, extending context with a few hundred steps of fine-tuning rather than a full retrain. This is how LLaMA-2's 4k window stretched past 32k without anyone training a 32k model from scratch.

Positional encoding is the layer that decides what your model knows about order, and whether that knowledge survives past the longest sequence it ever saw. The next post moves inside the attention heads themselves — MHA, MQA, GQA — where the binding constraint turns out not to be order but memory bandwidth.

Positional Encodings: From Sinusoids to RoPE

Absolute positions: sinusoids and lookup tables

Relative positions: bias the logits, not the embeddings

RoPE: rotation as relative position

Stretching the window after training

How to cite this article

Cite this work