Introduction

A transformer has no memory of time and no concept of before or after. Every token arrives simultaneously, processed in one big parallel operation. "Dog bites man" and "man bites dog" are, to the raw attention mechanism, the same bag of words. The dot products do not know which token came first.

The original transformer paper computed a fixed sine-and-cosine position vector for each absolute position and added it to the token embedding before the first layer, so a single hidden state ended up carrying both what the word was and where it sat. The shortcomings of that arrangement, which took the field a few years to fully internalize, are interrelated: combining content and position into the same vector forces the model to disentangle them inside attention, and because the model only sees position vectors for positions 0 through N during training, anything past N is an out-of-distribution input to every layer at inference. Learned absolute embeddings, used in BERT and early GPT, inherit the same problem in a sharper form, because they replace the formula with a lookup table that simply does not have a defined value at position N+1.

The Mental Model

RoPE encodes position by rotating token embeddings before computing attention. Relative position emerges directly from the dot product geometry.

Why Position Matters

To understand why RoPE is necessary, you need to understand what attention does without it. Self-attention computes:

\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right)V

The attention matrix $QK^T$ is computed between ALL pairs of tokens simultaneously.

This operation is permutation equivariant. If you shuffle the input tokens with permutation matrix P, the outputs are shuffled in exactly the same way. Formally:

X' = PX

(input permuted)

Q' = PQ, \quad K' = PK, \quad V' = PV

Attention output: same information, just reordered

The model cannot tell the difference between the original sequence and any permutation of it. "cat sat on mat" and "on mat cat sat" produce identical representations (just with tokens in different output slots). This is catastrophic for language, where order is everything.

The Consequence

Without position encoding, "the dog bit the man" and "the man bit the dog" are indistinguishable to the model. Sentence meaning collapses entirely. Every language model needs some form of positional information injected before or during attention.

Three Approaches to Positional Encoding

AbsoluteAdd a position vector to token embeddings. Position 1 gets vector

p_1

, position 2 gets

p_2

, etc. Cannot generalize beyond training length.

RelativeAdd a learned bias to attention scores based on distance between tokens. T5 uses this. Better generalization but adds compute overhead.

RoPERotate Q and K vectors before computing attention. Relative position emerges from the geometry of the dot product. No extra parameters needed.

The Origin Story

The original Transformer (Vaswani et al., 2017) used sinusoidal positional encoding: fixed vectors computed from sine and cosine functions at different frequencies. Each position $m$ got a vector:

PE(m, 2i) = \sin\!\left(\frac{m}{10000^{2i/d}}\right)

PE(m, 2i+1) = \cos\!\left(\frac{m}{10000^{2i/d}}\right)

Added directly to the token embedding:

x' = x + PE(m)

Sinusoidal Positional Encoding

PE(m,\,2i) = \sin\!\left(\frac{m}{10000^{2i/d}}\right) \qquad PE(m,\,2i+1) = \cos\!\left(\frac{m}{10000^{2i/d}}\right)

Each column is one position's d-dim encoding. Top rows oscillate fast, bottom rows barely move.

position m8

This worked, but adding a position vector to the content vector entangles two signals that the model has to disentangle later inside attention. More critically, learned absolute embeddings generally extrapolate poorly beyond the training context length, because positions outside the trained range either have no embedding at all (lookup tables) or have a value the model has never been asked to reason about (fixed sinusoidal vectors at unseen positions).

2017

Sinusoidal PE

Fixed absolute, added to embeddings. Fails beyond training length.

2019

Relative PE (T5)

Learned bias added to attention logits. Better generalization, extra parameters.

2021

RoPE

Rotation-based, no extra parameters, relative position from geometry.

Why RoPE Became the Standard

RoPE causes relative position information to emerge naturally from the geometry of the rotated query-key dot product, without requiring learned relative-position parameters or a separate attention bias. It also enables better context length extension through scaling techniques, which turned out to be critical as the field pushed from 4K to 128K+ token contexts.

The Core Idea

RoPE's key idea is to encode position by rotating the Query and Key vectors before computing attention. Each 2D pair of embedding dimensions is treated as a 2D plane, and the pair is rotated by an angle proportional to the token position:

R_m = \begin{pmatrix} \cos m\theta & -\sin m\theta \\ \sin m\theta & \cos m\theta \end{pmatrix}

The 2D rotation matrix applied to each dimension pair. $m$ is position, $\theta$ is a frequency.

For a d-dimensional embedding, we have d/2 dimension pairs, each with its own frequency. The full rotation is block-diagonal: each pair rotates independently. The rotated query at position m is:

\tilde{q}_m = R_{\Theta,m}^d \cdot q_m

Where $R_{\Theta,m}^d$ is the full block-diagonal rotation matrix with d/2 blocks.

Position is encoded as rotation angle. The embedding direction rotates; magnitude stays constant.

Block-Diagonal Rotation

POSITION m3

Fast pair (left) sweeps several full turns. Slow pair (right) barely moves. The same position index drives every pair, but each one sees it through a different frequency.

1. Position as Rotation

The original sinusoidal approach adds a position vector to the token embedding:

x' = x + p(m)

Absolute position: changes both direction AND magnitude of $x$

This has a fundamental problem: the content signal (x) and the position signal (p(m)) get added together and cannot be separated. The model has to disentangle them internally.

RoPE instead rotates the embedding by an angle proportional to position:

x' = R(m\theta) \, x

Rotation: changes direction, preserves magnitude

Why Rotation Is Better Than Addition

Addition

(x + p)

Changes direction of $x$
Changes magnitude of $x$
Content and position entangled
Position signal washes out content

Rotation

(R(m\theta)\,x)

Changes direction of $x$
Magnitude preserved: $\|Rx\| = \|x\|$
Content magnitude unchanged
Position is a clean directional label

The magnitude preservation comes from the orthogonality of rotation matrices. A rotation matrix $R$ satisfies $R^T R = I$ . Therefore:

\|Rx\|^2 = (Rx)^T(Rx) = x^T R^T R x = x^T I x = \|x\|^2

Rotation preserves the 2-norm. Only the direction of x changes.

The Geometric Picture

Think of each token's embedding as a vector in $d$ -dimensional space. Rotation moves the vector around a circle of constant radius. Token at position 1 sits at angle $\theta$ . Token at position 2 sits at angle $2\theta$ . The content (represented by the radius / magnitude) is untouched. Only where the vector points changes.

Interactive: Rotation

Adjust position and watch how the embedding vector rotates. Notice the rotation matrix values change, but the vector length stays constant.

Position as Rotation

pos m3

2. Complex Number Form

2D rotation has an elegant representation via complex numbers. Euler's formula says:

e^{i\theta} = \cos\theta + i\sin\theta

Multiplying a complex number by $e^{i\theta}$ rotates it by angle $\theta$ .

RoPE treats each consecutive pair of embedding dimensions $(x_1, x_2)$ as a single complex number $z = x_1 + i x_2$ . Rotation by angle $\theta$ becomes multiplication by $e^{i\theta}$ :

The Complex Multiplication Expanded

z' = (x_1 + i x_2)(\cos\theta + i\sin\theta)

\phantom{z'} = (x_1\cos\theta - x_2\sin\theta) + i(x_1\sin\theta + x_2\cos\theta)

Real part:

x_1\cos\theta - x_2\sin\theta

(= rotated

x_1

)

Imag part:

x_1\sin\theta + x_2\cos\theta

(= rotated

x_2

)

This is exactly the 2D rotation matrix. Complex multiplication IS the rotation.

For dimension pair $i$ at position $m$ , the rotation angle is $m\theta_i$ , giving:

z_i' = z_i \cdot e^{i m \theta_i}

$m$ = token position, $\theta_i$ = frequency for dimension pair $i$

Why This Matters for Implementation

Complex multiplication is faster than explicit $2 \times 2$ matrix multiplication. In practice, implementations precompute tables of $(\cos(m\theta_i), \sin(m\theta_i))$ for all positions and all dimension pairs, then apply them with elementwise multiply and rearrange. No explicit rotation matrices are ever instantiated.

3. Dot Product Geometry

This is the central result that makes RoPE work. Attention computes $q \cdot k$ . After rotating $q$ at position $m$ and $k$ at position $n$ , the attention score is:

Derivation: Relative Position from Dot Product

q_m = R(m)\,q

(query at position

m

, rotated)

k_n = R(n)\,k

(key at position

n

, rotated)

q_m \cdot k_n = (R(m)q)^T \, (R(n)k)

= q^T R(m)^T R(n)\, k

= q^T R(-m) R(n)\, k

(since

R^T = R^{-1} = R(-\theta)

)

= q^T R(n-m)\, k

Result depends only on

(n - m)

, the relative position.

The key step is that rotation matrices compose: $R(-m)\,R(n) = R(n-m)$ . Rotating by $-m$ then by $n$ is the same as rotating by $(n-m)$ total. The positional contribution to the score depends only on the relative offset $n-m$ , while the overall attention score still depends on the content vectors $q$ and $k$ .

In the 2D case the same identity expands into an explicit content / position split:

\langle R_m q,\, R_n k \rangle \;=\; \langle q, k \rangle \cos((m - n)\theta) \;+\; \langle q^{\perp},\, k \rangle \sin((m - n)\theta)

where $q^{\perp}$ is $q$ rotated by 90°. The score splits cleanly into content terms (the dot products) and a position term (cos and sin of the relative offset).

q_m \cdot k_n = f(q, k, \, n - m)

The attention score depends on the query content $q$ , key content $k$ , and their relative distance $(n-m)$ . Not on $m$ or $n$ individually.

Why This Is Remarkable

No special "relative position bias" is needed. No extra parameters are learned. The relative position structure emerges automatically from applying rotation to Q and K before the dot product. The model learns Q and K projection matrices, and the geometry of rotation does the rest.

Interactive: Relative Position

Adjust query and key positions. Notice how the dot product depends only on their difference, not their absolute values.

Relative Position from Dot Product

Q pos3

K pos1

4. Frequency Bands

A single rotation frequency $\theta$ is not enough. Different aspects of meaning operate at different scales: local syntax (adjacent tokens) and global semantics (tokens far apart) are both important. RoPE uses a different rotation frequency for each dimension pair:

\theta_i = 10000^{-2i/d}

For i = 0, 1, ..., d/2 - 1. Frequencies decay geometrically with dimension index.

The period of each dimension pair (how many positions to complete one full rotation) is:

\text{period}_i = \frac{2\pi}{\theta_i} = 2\pi \times 10000^{2i/d}

Concrete Periods ( $d_{\text{head}} = 128$ )

Dim pair i	$\theta_i$	Period (positions)	Captures
i = 0	1.000	~6.3	Immediate neighbors
i = 10	0.257	~24	Phrase-level
i = 32	0.010	~628	Paragraph-level
i = 63	0.0001	~60,318	Document-level

The fastest pair completes a full rotation in just 6 positions. The slowest takes 60,000. For a 4096-token context, the slowest dim pair has only rotated through about 24° of its full cycle.

High Frequency (

i

near 0)

Rotates fast. Completes cycles in just a few positions. Captures short-range structure: neighboring tokens, bigrams, local syntax.

Low Frequency (

i

near

d/2

)

Rotates slowly. Period spans thousands of positions. Captures long-range dependencies: document structure, coreference, global topic.

The Fourier Analogy

This is the same idea as a Fourier decomposition: high-frequency components carry rapid, local variation, low-frequency components carry slow, global structure. RoPE applies that multi-scale spectral structure to position, with the frequencies fixed by formula rather than learned from data.

Interactive: Frequencies

Watch different frequency bands rotate at different speeds across positions.

Frequency Bands

POSITION4

STEP TO

d=0: -0.000

d=2: +0.588

d=4: +0.249

d=6: +0.025

5. Long Context Extrapolation

RoPE encodes relative position in the attention dot product. In principle, "token A is 5 positions before token B" should mean the same thing whether A is at position 10 or position 10,000. So why does a model trained on 4,096-token sequences degrade when given 8,192 tokens?

The Aliasing Problem

The high-frequency dimension pairs complete full $2\pi$ rotations every ~6 positions. At relative distance 7, the rotation angle wraps around and becomes indistinguishable from relative distance ~0.7. This is fine during training where the model sees all those distances. But at very long absolute positions, the model starts encountering combinations of rotation angles it was never trained on.

More precisely: the model learns attention patterns that map certain (rotation angle differences) to certain (how much to attend). Beyond training length, low-frequency dimension pairs have rotated to angles they have never encountered. The model has no learned behavior for those configurations.

Absolute Embeddings

Position 8192 was never seen. The embedding is literally undefined or random. Quality collapses immediately.

Raw RoPE

Degrades more gracefully (relative position is still encoded), but performance typically degrades beyond the training context length, with the onset and severity varying considerably across models.

The Core Advantage Over Absolute PE

RoPE still extrapolates much better than absolute embeddings because the relative distance structure is preserved. Many RoPE-based models retain useful performance somewhat beyond their training context length, though quality generally degrades with increasing extrapolation. Extensions like NTK-aware scaling and YaRN push usable context to 32K-128K with little to no quality loss.

Interactive: Extrapolation

Compare training vs inference lengths and see how RoPE handles extrapolation compared to absolute embeddings.

Long Context Extrapolation

Train Length

Test Length

RoPE Extensions

Three main techniques extend RoPE to longer contexts than the model was trained on. Each makes a different trade-off between local and global position resolution.

Position Interpolation (PI)

Proposed by Chen et al. (Meta, 2023). Instead of extrapolating to unseen positions, interpolate between seen ones. If trained on $L = 4096$ tokens and you want $L' = 32768$ tokens, scale every position index by $L/L'$ :

m' = m \cdot \frac{L}{L'}

(e.g., position 8192 maps to 1024)

All rotation angles stay within the training range. But the problem is that high-frequency dimension pairs now have their angles compressed: nearby tokens that previously had distinct high-frequency signals now look more similar to each other. Local structure gets blurred.

Requires ~1000 steps of fine-tuning on longer sequences to recover quality.

NTK-Aware Scaling

Discovered by u/bloc97 (2023), motivated by Neural Tangent Kernel theory. The key insight: instead of scaling positions, change the RoPE base from 10000 to a larger value:

b' = 10000 \cdot \alpha^{d/(d-2)}

where

\alpha = L'/L

A larger base increases the periods of all frequency bands. Crucially, high-frequency dims change least (their period was already short and stays short). Low-frequency dims change most (their very long periods get even longer, covering the extended context). This is the right direction: we want long-range dims to cover longer ranges without damaging short-range dims.

Works for moderate extensions (2-4x) without any fine-tuning.

YaRN (Yet Another RoPE extensioN)

Peng et al. (2023). Combines NTK scaling with frequency-band-aware interpolation and an attention temperature adjustment. YaRN divides dimension pairs into three groups based on their wavelength relative to the target context length $L'$ :

Short periodWavelength

< L

: keep frequency unchanged. These dims are fine.

Mid periodWavelength between

L

and

L'

: apply NTK-style base scaling.

Long periodWavelength

> L'

: apply linear interpolation (PI-style).

YaRN also multiplies attention scores by a temperature factor $1/\sqrt{0.1 \ln(s) + 1}$ to counteract the increased attention entropy at longer contexts. Achieves 64K-token context from a 4K-trained model with only ~400 fine-tuning steps.

Used by Nous Research's Yarn-Mistral 64K/128K, recommended as a scaling option in Qwen long-context releases, and adopted in many community Llama-based 32K-128K fine-tunes.

Extension Performance Comparison

Method	2K to 8K	2K to 32K	Fine-tuning?
Raw RoPE	Poor	Fails	N/A
PI	OK	Degrades	Required
NTK	Good	Moderate	Optional
YaRN	Excellent	Good	Optional (fast)

Models Using RoPE

RoPE has become the de facto standard for modern autoregressive language models. Almost every significant open-weight model released since 2023 uses it.

Uses Full RoPE

LLaMA 1, 2, 3 (Meta)
Mistral 7B, Mixtral 8x7B, 8x22B
Gemma / Gemma 2 (Google)
Qwen 1.5, 2, 2.5 (Alibaba)
DeepSeek V2, V3
GPT-NeoX / GPT-J (EleutherAI)
OLMo (Allen AI)
Phi-3 (Microsoft)
CodeLlama (with NTK extension)

Uses Other Approaches

GPT-3 (OpenAI): Learned positional embeddings. GPT-4's positional encoding mechanism has not been publicly disclosed.
T5 / Flan-T5: Relative position bias (T5-style)
BERT: Learned absolute embeddings (encoder only)
ALiBi models: Attention with Linear Biases (MPT, BLOOM variants)

Why the Entire Field Converged on RoPE

Three properties sealed RoPE's dominance. First, it is parameter-free: no extra weights are needed to represent position, unlike learned absolute embeddings. Second, relative position information emerges naturally from the geometry of the rotated query-key dot product, without any extra learned bias term. Third, it is extensible: NTK scaling and YaRN can push context length far beyond training without full retraining, which turned out to be essential as the field moved from 4K to 128K token contexts.

Contents

RoPE (Rotary Position Embedding)

Introduction

The Mental Model

Why Position Matters

The Consequence

Three Approaches to Positional Encoding

The Origin Story

Why RoPE Became the Standard

The Core Idea

1. Position as Rotation

Why Rotation Is Better Than Addition

The Geometric Picture

Interactive: Rotation

2. Complex Number Form

The Complex Multiplication Expanded

Why This Matters for Implementation

3. Dot Product Geometry

Derivation: Relative Position from Dot Product

Why This Is Remarkable

Interactive: Relative Position

4. Frequency Bands

Concrete Periods ( $d_{\text{head}} = 128$ )

The Fourier Analogy

Interactive: Frequencies

5. Long Context Extrapolation

The Aliasing Problem

The Core Advantage Over Absolute PE

Interactive: Extrapolation

RoPE Extensions

Position Interpolation (PI)

NTK-Aware Scaling

YaRN (Yet Another RoPE extensioN)

Extension Performance Comparison

Models Using RoPE

Uses Full RoPE

Uses Other Approaches

Why the Entire Field Converged on RoPE

Contents

Introduction

The Mental Model

Why Position Matters

The Consequence

Three Approaches to Positional Encoding

The Origin Story

Why RoPE Became the Standard

The Core Idea

1. Position as Rotation

Why Rotation Is Better Than Addition

The Geometric Picture

Interactive: Rotation

2. Complex Number Form

The Complex Multiplication Expanded

Why This Matters for Implementation

3. Dot Product Geometry

Derivation: Relative Position from Dot Product

Why This Is Remarkable

Interactive: Relative Position

4. Frequency Bands

Concrete Periods (dhead=128d_{\text{head}} = 128dhead​=128)

The Fourier Analogy

Interactive: Frequencies

5. Long Context Extrapolation

The Aliasing Problem

The Core Advantage Over Absolute PE

Interactive: Extrapolation

RoPE Extensions

Position Interpolation (PI)

NTK-Aware Scaling

YaRN (Yet Another RoPE extensioN)

Extension Performance Comparison

Models Using RoPE

Uses Full RoPE

Uses Other Approaches

Why the Entire Field Converged on RoPE

Concrete Periods ( $d_{\text{head}} = 128$ )