Modules
02/30
RoPE

Contents

RoPE (Rotary Position Embedding)

What RoPE is, how relative position emerges from the dot product, and how it extends to long context.

Introduction

A transformer has no memory of time and no concept of before or after. Every token arrives simultaneously, processed in one big parallel operation. "Dog bites man" and "man bites dog" are, to the raw attention mechanism, the same bag of words. The dot products do not know which token came first.

The original transformer paper computed a fixed sine-and-cosine position vector for each absolute position and added it to the token embedding before the first layer, so a single hidden state ended up carrying both what the word was and where it sat. The shortcomings of that arrangement, which took the field a few years to fully internalize, are interrelated: combining content and position into the same vector forces the model to disentangle them inside attention, and because the model only sees position vectors for positions 0 through N during training, anything past N is an out-of-distribution input to every layer at inference. Learned absolute embeddings, used in BERT and early GPT, inherit the same problem in a sharper form, because they replace the formula with a lookup table that simply does not have a defined value at position N+1.

The Mental Model

RoPE encodes position by rotating token embeddings before computing attention. Relative position emerges directly from the dot product geometry.

Why Position Matters

To understand why RoPE is necessary, you need to understand what attention does without it. Self-attention computes:

Attention(Q,K,V)=softmax ⁣(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right)V

The attention matrix QKTQK^T is computed between ALL pairs of tokens simultaneously.

This operation is permutation equivariant. If you shuffle the input tokens with permutation matrix P, the outputs are shuffled in exactly the same way. Formally:

X=PXX' = PX   (input permuted)
Q=PQ,K=PK,V=PVQ' = PQ, \quad K' = PK, \quad V' = PV
Attention output: same information, just reordered

The model cannot tell the difference between the original sequence and any permutation of it. "cat sat on mat" and "on mat cat sat" produce identical representations (just with tokens in different output slots). This is catastrophic for language, where order is everything.

The Consequence

Without position encoding, "the dog bit the man" and "the man bit the dog" are indistinguishable to the model. Sentence meaning collapses entirely. Every language model needs some form of positional information injected before or during attention.

Three Approaches to Positional Encoding

AbsoluteAdd a position vector to token embeddings. Position 1 gets vector p1p_1, position 2 gets p2p_2, etc. Cannot generalize beyond training length.
RelativeAdd a learned bias to attention scores based on distance between tokens. T5 uses this. Better generalization but adds compute overhead.
RoPERotate Q and K vectors before computing attention. Relative position emerges from the geometry of the dot product. No extra parameters needed.

The Origin Story

The original Transformer (Vaswani et al., 2017) used sinusoidal positional encoding: fixed vectors computed from sine and cosine functions at different frequencies. Each position mm got a vector:

PE(m,2i)=sin ⁣(m100002i/d)PE(m, 2i) = \sin\!\left(\frac{m}{10000^{2i/d}}\right)
PE(m,2i+1)=cos ⁣(m100002i/d)PE(m, 2i+1) = \cos\!\left(\frac{m}{10000^{2i/d}}\right)
Added directly to the token embedding: x=x+PE(m)x' = x + PE(m)
Sinusoidal Positional Encoding
PE(m,2i)=sin ⁣(m100002i/d)PE(m,2i+1)=cos ⁣(m100002i/d)PE(m,\,2i) = \sin\!\left(\frac{m}{10000^{2i/d}}\right) \qquad PE(m,\,2i+1) = \cos\!\left(\frac{m}{10000^{2i/d}}\right)

Each column is one position's d-dim encoding. Top rows oscillate fast, bottom rows barely move.

dim02468101214m=808162432404856position mPE(m=8) =02468101214dimnegativenear zeropositive
position m8

This worked, but adding a position vector to the content vector entangles two signals that the model has to disentangle later inside attention. More critically, learned absolute embeddings generally extrapolate poorly beyond the training context length, because positions outside the trained range either have no embedding at all (lookup tables) or have a value the model has never been asked to reason about (fixed sinusoidal vectors at unseen positions).

2017
Sinusoidal PE

Fixed absolute, added to embeddings. Fails beyond training length.

2019
Relative PE (T5)

Learned bias added to attention logits. Better generalization, extra parameters.

2021
RoPE

Rotation-based, no extra parameters, relative position from geometry.

Why RoPE Became the Standard

RoPE causes relative position information to emerge naturally from the geometry of the rotated query-key dot product, without requiring learned relative-position parameters or a separate attention bias. It also enables better context length extension through scaling techniques, which turned out to be critical as the field pushed from 4K to 128K+ token contexts.

The Core Idea

RoPE's key idea is to encode position by rotating the Query and Key vectors before computing attention. Each 2D pair of embedding dimensions is treated as a 2D plane, and the pair is rotated by an angle proportional to the token position:

Rm=(cosmθsinmθsinmθcosmθ)R_m = \begin{pmatrix} \cos m\theta & -\sin m\theta \\ \sin m\theta & \cos m\theta \end{pmatrix}

The 2D rotation matrix applied to each dimension pair. mm is position, θ\theta is a frequency.

For a d-dimensional embedding, we have d/2 dimension pairs, each with its own frequency. The full rotation is block-diagonal: each pair rotates independently. The rotated query at position m is:

q~m=RΘ,mdqm\tilde{q}_m = R_{\Theta,m}^d \cdot q_m

Where RΘ,mdR_{\Theta,m}^d is the full block-diagonal rotation matrix with d/2 blocks.

Position is encoded as rotation angle. The embedding direction rotates; magnitude stays constant.

Block-Diagonal Rotation
One d-dim embedding, d/2 independent 2D rotationsEach pair rotates at its own frequency θ_i. Drag position m below.pair 0θ = 1.20mθ = 3.60pair 1θ = 0.55mθ = 1.65pair 2θ = 0.22mθ = 0.66pair 3θ = 0.08mθ = 0.24full d-dimensional embedding, split into d/2 = 4 rotating pairs
POSITION m3
Fast pair (left) sweeps several full turns. Slow pair (right) barely moves. The same position index drives every pair, but each one sees it through a different frequency.

1. Position as Rotation

The original sinusoidal approach adds a position vector to the token embedding:

x=x+p(m)x' = x + p(m)

Absolute position: changes both direction AND magnitude of xx

This has a fundamental problem: the content signal (x) and the position signal (p(m)) get added together and cannot be separated. The model has to disentangle them internally.

RoPE instead rotates the embedding by an angle proportional to position:

x=R(mθ)xx' = R(m\theta) \, x

Rotation: changes direction, preserves magnitude

Why Rotation Is Better Than Addition

Addition (x+p)(x + p)
  • Changes direction of xx
  • Changes magnitude of xx
  • Content and position entangled
  • Position signal washes out content
Rotation (R(mθ)x)(R(m\theta)\,x)
  • Changes direction of xx
  • Magnitude preserved: Rx=x\|Rx\| = \|x\|
  • Content magnitude unchanged
  • Position is a clean directional label

The magnitude preservation comes from the orthogonality of rotation matrices. A rotation matrix RR satisfies RTR=IR^T R = I. Therefore:

Rx2=(Rx)T(Rx)=xTRTRx=xTIx=x2\|Rx\|^2 = (Rx)^T(Rx) = x^T R^T R x = x^T I x = \|x\|^2

Rotation preserves the 2-norm. Only the direction of x changes.

The Geometric Picture

Think of each token's embedding as a vector in dd-dimensional space. Rotation moves the vector around a circle of constant radius. Token at position 1 sits at angle θ\theta. Token at position 2 sits at angle 2θ2\theta. The content (represented by the radius / magnitude) is untouched. Only where the vector points changes.

Interactive: Rotation

Adjust position and watch how the embedding vector rotates. Notice the rotation matrix values change, but the vector length stays constant.

Position as Rotation
ReImt=090°t=3R(θ) — rotation matrixcos θ-sin θ0.00-1.001.000.00θ = 3 × 30° = 90°
pos m3

2. Complex Number Form

2D rotation has an elegant representation via complex numbers. Euler's formula says:

eiθ=cosθ+isinθe^{i\theta} = \cos\theta + i\sin\theta

Multiplying a complex number by eiθe^{i\theta} rotates it by angle θ\theta.

RoPE treats each consecutive pair of embedding dimensions (x1,x2)(x_1, x_2) as a single complex number z=x1+ix2z = x_1 + i x_2. Rotation by angle θ\theta becomes multiplication by eiθe^{i\theta}:

The Complex Multiplication Expanded

z=(x1+ix2)(cosθ+isinθ)z' = (x_1 + i x_2)(\cos\theta + i\sin\theta)
z=(x1cosθx2sinθ)+i(x1sinθ+x2cosθ)\phantom{z'} = (x_1\cos\theta - x_2\sin\theta) + i(x_1\sin\theta + x_2\cos\theta)
Real part: x1cosθx2sinθx_1\cos\theta - x_2\sin\theta (= rotated x1x_1)
Imag part: x1sinθ+x2cosθx_1\sin\theta + x_2\cos\theta (= rotated x2x_2)

This is exactly the 2D rotation matrix. Complex multiplication IS the rotation.

For dimension pair ii at position mm, the rotation angle is mθim\theta_i, giving:

zi=zieimθiz_i' = z_i \cdot e^{i m \theta_i}

mm = token position, θi\theta_i = frequency for dimension pair ii

Why This Matters for Implementation

Complex multiplication is faster than explicit 2×22 \times 2 matrix multiplication. In practice, implementations precompute tables of (cos(mθi),sin(mθi))(\cos(m\theta_i), \sin(m\theta_i)) for all positions and all dimension pairs, then apply them with elementwise multiply and rearrange. No explicit rotation matrices are ever instantiated.

3. Dot Product Geometry

This is the central result that makes RoPE work. Attention computes qkq \cdot k. After rotating qq at position mm and kk at position nn, the attention score is:

Derivation: Relative Position from Dot Product

qm=R(m)qq_m = R(m)\,q(query at position mm, rotated)
kn=R(n)kk_n = R(n)\,k(key at position nn, rotated)
qmkn=(R(m)q)T(R(n)k)q_m \cdot k_n = (R(m)q)^T \, (R(n)k)
=qTR(m)TR(n)k= q^T R(m)^T R(n)\, k
=qTR(m)R(n)k= q^T R(-m) R(n)\, k(since RT=R1=R(θ)R^T = R^{-1} = R(-\theta))
=qTR(nm)k= q^T R(n-m)\, k
Result depends only on (nm)(n - m), the relative position.

The key step is that rotation matrices compose: R(m)R(n)=R(nm)R(-m)\,R(n) = R(n-m). Rotating by m-m then by nn is the same as rotating by (nm)(n-m) total. The positional contribution to the score depends only on the relative offset nmn-m, while the overall attention score still depends on the content vectors qq and kk.

In the 2D case the same identity expands into an explicit content / position split:

Rmq,Rnk  =  q,kcos((mn)θ)  +  q,ksin((mn)θ)\langle R_m q,\, R_n k \rangle \;=\; \langle q, k \rangle \cos((m - n)\theta) \;+\; \langle q^{\perp},\, k \rangle \sin((m - n)\theta)

where qq^{\perp} is qq rotated by 90°. The score splits cleanly into content terms (the dot products) and a position term (cos and sin of the relative offset).

qmkn=f(q,k,nm)q_m \cdot k_n = f(q, k, \, n - m)

The attention score depends on the query content qq, key content kk, and their relative distance (nm)(n-m). Not on mm or nn individually.

Why This Is Remarkable

No special "relative position bias" is needed. No extra parameters are learned. The relative position structure emerges automatically from applying rotation to Q and K before the dot product. The model learns Q and K projection matrices, and the geometry of rotation does the rest.

Interactive: Relative Position

Adjust query and key positions. Notice how the dot product depends only on their difference, not their absolute values.

Relative Position from Dot Product
ΔθK t=1Q t=3Δpos = posQ − posK+2angle = Δpos × 22.5°+45.0°Q·K ∝ cos(Δθ)0.707(R(θ_Q)·Q) · (R(θ_K)·K) = Q · R(θ_Q − θ_K) · K
Q pos3
K pos1

4. Frequency Bands

A single rotation frequency θ\theta is not enough. Different aspects of meaning operate at different scales: local syntax (adjacent tokens) and global semantics (tokens far apart) are both important. RoPE uses a different rotation frequency for each dimension pair:

θi=100002i/d\theta_i = 10000^{-2i/d}

For i = 0, 1, ..., d/2 - 1. Frequencies decay geometrically with dimension index.

The period of each dimension pair (how many positions to complete one full rotation) is:

periodi=2πθi=2π×100002i/d\text{period}_i = \frac{2\pi}{\theta_i} = 2\pi \times 10000^{2i/d}

Concrete Periods (dhead=128d_{\text{head}} = 128)

Dim pair iθi\theta_iPeriod (positions)Captures
i = 01.000~6.3Immediate neighbors
i = 100.257~24Phrase-level
i = 320.010~628Paragraph-level
i = 630.0001~60,318Document-level

The fastest pair completes a full rotation in just 6 positions. The slowest takes 60,000. For a 4096-token context, the slowest dim pair has only rotated through about 24° of its full cycle.

High Frequency (ii near 0)

Rotates fast. Completes cycles in just a few positions. Captures short-range structure: neighboring tokens, bigrams, local syntax.

Low Frequency (ii near d/2d/2)

Rotates slowly. Period spans thousands of positions. Captures long-range dependencies: document structure, coreference, global topic.

The Fourier Analogy

This is the same idea as a Fourier decomposition: high-frequency components carry rapid, local variation, low-frequency components carry slow, global structure. RoPE applies that multi-scale spectral structure to position, with the frequencies fixed by formula rather than learned from data.

Interactive: Frequencies

Watch different frequency bands rotate at different speeds across positions.

Frequency Bands
d=0 (high freq)f=0.25d=2f=0.1d=4f=0.01d=6 (low freq)f=0.001pos=41481216positionHigh freq: captures short-range position differencesLow freq: captures long-range structure
POSITION4
STEP TO
d=0: -0.000
d=2: +0.588
d=4: +0.249
d=6: +0.025

5. Long Context Extrapolation

RoPE encodes relative position in the attention dot product. In principle, "token A is 5 positions before token B" should mean the same thing whether A is at position 10 or position 10,000. So why does a model trained on 4,096-token sequences degrade when given 8,192 tokens?

The Aliasing Problem

The high-frequency dimension pairs complete full 2π2\pi rotations every ~6 positions. At relative distance 7, the rotation angle wraps around and becomes indistinguishable from relative distance ~0.7. This is fine during training where the model sees all those distances. But at very long absolute positions, the model starts encountering combinations of rotation angles it was never trained on.

More precisely: the model learns attention patterns that map certain (rotation angle differences) to certain (how much to attend). Beyond training length, low-frequency dimension pairs have rotated to angles they have never encountered. The model has no learned behavior for those configurations.

Absolute Embeddings

Position 8192 was never seen. The embedding is literally undefined or random. Quality collapses immediately.

Raw RoPE

Degrades more gracefully (relative position is still encoded), but performance typically degrades beyond the training context length, with the onset and severity varying considerably across models.

The Core Advantage Over Absolute PE

RoPE still extrapolates much better than absolute embeddings because the relative distance structure is preserved. Many RoPE-based models retain useful performance somewhat beyond their training context length, though quality generally degrades with increasing extrapolation. Extensions like NTK-aware scaling and YaRN push usable context to 32K-128K with little to no quality loss.

Interactive: Extrapolation

Compare training vs inference lengths and see how RoPE handles extrapolation compared to absolute embeddings.

Long Context Extrapolation
TrainingExtrapolation02K4Krotation continues smoothly across boundaryAbsolute PEUnseen embeddings beyond trainPosition 10,000 never observedQuality degrades rapidlyRoPESame rotation formula at any positionRelative distances stay meaningfulGraceful extrapolationTrain2KTest4KMultiplier2.0x
Train Length
Test Length

RoPE Extensions

Three main techniques extend RoPE to longer contexts than the model was trained on. Each makes a different trade-off between local and global position resolution.

Position Interpolation (PI)

Proposed by Chen et al. (Meta, 2023). Instead of extrapolating to unseen positions, interpolate between seen ones. If trained on L=4096L = 4096 tokens and you want L=32768L' = 32768 tokens, scale every position index by L/LL/L':

m=mLLm' = m \cdot \frac{L}{L'}(e.g., position 8192 maps to 1024)

All rotation angles stay within the training range. But the problem is that high-frequency dimension pairs now have their angles compressed: nearby tokens that previously had distinct high-frequency signals now look more similar to each other. Local structure gets blurred.

Requires ~1000 steps of fine-tuning on longer sequences to recover quality.

NTK-Aware Scaling

Discovered by u/bloc97 (2023), motivated by Neural Tangent Kernel theory. The key insight: instead of scaling positions, change the RoPE base from 10000 to a larger value:

b=10000αd/(d2)b' = 10000 \cdot \alpha^{d/(d-2)}where α=L/L\alpha = L'/L

A larger base increases the periods of all frequency bands. Crucially, high-frequency dims change least (their period was already short and stays short). Low-frequency dims change most (their very long periods get even longer, covering the extended context). This is the right direction: we want long-range dims to cover longer ranges without damaging short-range dims.

Works for moderate extensions (2-4x) without any fine-tuning.

YaRN (Yet Another RoPE extensioN)

Peng et al. (2023). Combines NTK scaling with frequency-band-aware interpolation and an attention temperature adjustment. YaRN divides dimension pairs into three groups based on their wavelength relative to the target context length LL':

Short periodWavelength <L< L: keep frequency unchanged. These dims are fine.
Mid periodWavelength between LL and LL': apply NTK-style base scaling.
Long periodWavelength >L> L': apply linear interpolation (PI-style).

YaRN also multiplies attention scores by a temperature factor 1/0.1ln(s)+11/\sqrt{0.1 \ln(s) + 1} to counteract the increased attention entropy at longer contexts. Achieves 64K-token context from a 4K-trained model with only ~400 fine-tuning steps.

Used by Nous Research's Yarn-Mistral 64K/128K, recommended as a scaling option in Qwen long-context releases, and adopted in many community Llama-based 32K-128K fine-tunes.

Extension Performance Comparison

Method2K to 8K2K to 32KFine-tuning?
Raw RoPEPoorFailsN/A
PIOKDegradesRequired
NTKGoodModerateOptional
YaRNExcellentGoodOptional (fast)

Models Using RoPE

RoPE has become the de facto standard for modern autoregressive language models. Almost every significant open-weight model released since 2023 uses it.

Uses Full RoPE

  • LLaMA 1, 2, 3 (Meta)
  • Mistral 7B, Mixtral 8x7B, 8x22B
  • Gemma / Gemma 2 (Google)
  • Qwen 1.5, 2, 2.5 (Alibaba)
  • DeepSeek V2, V3
  • GPT-NeoX / GPT-J (EleutherAI)
  • OLMo (Allen AI)
  • Phi-3 (Microsoft)
  • CodeLlama (with NTK extension)

Uses Other Approaches

  • GPT-3 (OpenAI): Learned positional embeddings. GPT-4's positional encoding mechanism has not been publicly disclosed.
  • T5 / Flan-T5: Relative position bias (T5-style)
  • BERT: Learned absolute embeddings (encoder only)
  • ALiBi models: Attention with Linear Biases (MPT, BLOOM variants)

Why the Entire Field Converged on RoPE

Three properties sealed RoPE's dominance. First, it is parameter-free: no extra weights are needed to represent position, unlike learned absolute embeddings. Second, relative position information emerges naturally from the geometry of the rotated query-key dot product, without any extra learned bias term. Third, it is extensible: NTK scaling and YaRN can push context length far beyond training without full retraining, which turned out to be essential as the field moved from 4K to 128K token contexts.