Table of Contents

Feed “cat bites dog” and “dog bites cat” into a Transformer — without positional information, these two sentences are identical to the model: just the tokens “cat,” “bites,” “dog” in some order. Self-attention lets each token attend to all others, but that “fully connected” design loses the concept of sequence order entirely. Positional Encoding was introduced in the original Transformer paper as the fix, but from 2017 to today, the solution to this problem has evolved through several generations.

TL;DR

  • Sinusoidal absolute positional encoding (original Transformer): computes position vectors using sine/cosine functions, no training needed, but can’t extrapolate beyond training sequence length
  • Learnable absolute positional encoding (GPT-2, BERT): trains position vectors as parameters, some flexibility but equally unable to extrapolate
  • Relative positional encoding (T5, ALiBi): attention directly perceives relative distance between tokens, more friendly for long sequences
  • RoPE (LLaMA, Mistral, Qwen, DeepSeek, most modern LLMs): multiplies positional information into Query and Key using rotation matrices — parameter-free, naturally encodes relative distance, extendable via YaRN and similar techniques — currently the dominant approach

The Problem

Why Positional Encoding Is Needed

The self-attention computation is:

Attention(Q, K, V) = softmax(QK^T / √d_k) × V

This computation is permutation-invariant over the input token order. Shuffle the input sequence and each token’s output vector simply rearranges — the values don’t change. That’s fine for image patch classification or set problems, but in language, word order carries enormous semantic information.

Positional encoding’s task: inject position information into token representations without modifying the attention mechanism itself.

How Each Approach Works

Approach 1: Sinusoidal Absolute Positional Encoding (Vaswani et al., 2017)

The original Transformer paper’s method: for each position pos, each dimension i, compute:

PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

This vector is added directly to the token embedding, and the position information is carried implicitly through all subsequent computations.

Intuition: Different dimensions use sine waves of different frequencies, analogous to binary counting — low-frequency dimensions capture coarse position (is this in the first half or second half?), high-frequency dimensions capture fine position (which exact slot?).

Downside: Training only sees sequences up to a certain length. At inference time, positions beyond that length have no learned PE, and performance drops sharply.

Approach 2: Learnable Absolute Positional Encoding (GPT-2, BERT)

Instead of formulas, build a max_seq_len × d_model embedding table where each position’s vector is trained via backpropagation. BERT and GPT-2 both use this.

Advantage: The model can learn position representations suited to the task.

Disadvantages:

  1. Increases parameter count
  2. Still can’t extrapolate — no vectors beyond position 512
  3. The “relative relationship” between position 1 and position 2 isn’t explicitly modeled; the model has to learn it

Approach 3: Relative Positional Encoding

T5’s approach (Shaw et al., 2018): Instead of adding position to embeddings, directly add a relative position bias to the attention computation for each (query token, key token) pair. This makes the attention scores themselves carry relative distance information.

ALiBi (Press et al., 2021): A cleaner relative encoding — for each attention head, add a negative linear bias proportional to relative distance directly to the attention logit. No extra parameters needed; more distant tokens get a larger negative penalty (effectively decaying). ALiBi performs relatively robustly when extrapolating to longer sequences.

Approach 4: RoPE — Rotary Positional Embedding (Su et al., 2021)

RoPE is the most widely adopted positional encoding scheme today, used by LLaMA, Mistral, Qwen, DeepSeek, PaLM 2, and nearly all modern LLMs.

Core idea: Multiply positional information into Query and Key vectors, rather than adding it to token embeddings. Done via rotation matrices:

For a token at position m, rotate each pair of dimensions (q_{2i}, q_{2i+1}) of its Q vector:

[q_{2i}' ]   [cos(mθ_i)  -sin(mθ_i)] [q_{2i}  ]
[q_{2i+1}'] = [sin(mθ_i)   cos(mθ_i)] [q_{2i+1}]

where θ_i = 10000^(-2i/d_model) — a frequency design similar to sinusoidal.

Why does this work? When computing the dot product of Q at position m with K at position n:

Q_m^T · K_n = f(q, m)^T · f(k, n) = depends only on (q, k, m-n)

The dot product result depends only on the relative position m-n, not on the absolute position. This naturally encodes relative distance into the attention computation without modifying the attention formula itself.

RoPE’s engineering advantages:

  • Parameter-free: No additional learnable parameters
  • Naturally encodes relative distance: Dot product value depends only on relative position
  • Extendable: With techniques like YaRN (Yet another RoPE extensioN) and Positional Interpolation, the training context window can be extended severalfold — Llama 3.1 uses RoPE + long-context fine-tuning to reach 128K context
               Absolute Positional Encoding
               ┌──────────────────────────────┐
               │ Sinusoidal (additive)        │ ← original Transformer
               │ Learnable embedding (additive)│ ← BERT, GPT-2
               └──────────────────────────────┘

               Relative Positional Encoding
               ┌──────────────────────────────┐
               │ T5 Bias (attention)          │ ← T5
               │ ALiBi (linear decay)         │ ← BLOOM, MPT
               └──────────────────────────────┘

               Rotary Encoding (multiplicative)
               ┌──────────────────────────────┐
               │ RoPE                         │ ← LLaMA, Mistral,
               │                              │   Qwen, DeepSeek
               └──────────────────────────────┘

What About No Positional Encoding?

Some 2023 research explored whether Transformers without positional encoding could work. The conclusion: for specific tasks with few tokens (classification), models can infer position implicitly from causal masking. But for language generation, models without positional encoding have higher training loss and significantly worse generation quality. Non-Transformer architectures like Mamba and RWKV encode position implicitly through SSM (State Space Model) or RNN time steps — that’s a different path.

Summary

SchemeParametersExtrapolationRelative DistanceModern LLM Adoption
SinusoidalNonePoorIndirectRare
Learnable absoluteYesPoorIndirectRare (BERT era)
T5 BiasFewMediumDirectT5 family
ALiBiNoneGoodDirect (linear)BLOOM, MPT
RoPENoneGood (with help)Direct (rotation)LLaMA, Mistral, Qwen…

RoPE’s dominance isn’t accidental — it simultaneously satisfies “parameter-free,” “relative distance,” and “extendable,” and has been validated across a large number of LLM training runs. Understanding RoPE’s mathematical principle also helps explain why long-context extrapolation techniques like YaRN work: fundamentally, it’s adjusting θ frequencies so the model acts as if it’s still within its training position range.

References

🇺🇸 English

Here's a question that cuts right to the heart of how language models work: if you feed a Transformer the sentence "cat bites dog" versus "dog bites cat," does it know the difference? Without positional encoding — no. Those two sentences are literally identical to the model. Just a bag of tokens floating in space, with no concept of who comes first.

That's the core problem. Self-attention is what researchers call permutation-invariant. Every token gets to look at every other token, but that "everyone talks to everyone" design throws sequence order straight out the window. So the question becomes: how do you put order back in? And the answer has evolved dramatically from 2017 to today.

Let's walk through the four generations.

The first approach came from the original Transformer paper itself, back in 2017. The idea was elegant: for each position in a sequence, compute a unique vector using sine and cosine waves at different frequencies. Think of it like binary counting — some dimensions oscillate slowly, giving you a coarse sense of "am I in the first half of the sentence or the second half?" Other dimensions oscillate fast, giving you fine-grained "exactly which slot am I in?" You add this vector directly to the token's embedding before anything else happens. No training required, no extra parameters. The math just generates it.

The catch? The model only ever sees sequences up to a certain length during training. Push past that limit at inference time, and you're in uncharted territory — performance falls off a cliff.

The second approach, used by BERT and GPT-2, said: forget the formula, let's just learn the position vectors. You build a lookup table — one vector per position slot — and train it end-to-end alongside everything else. The model can learn whatever position representation actually helps the task. More flexible, in theory. But it still has that hard wall at the maximum training length. Position 513 simply doesn't exist if you trained up to 512. And the model has to implicitly figure out that "position 1 to position 2" is the same relationship as "position 50 to position 51" — nothing in the architecture makes that explicit.

The third generation shifted the philosophy entirely. Instead of injecting position into the token embeddings before attention, why not inject it directly into the attention computation itself? T5 pioneered this by adding a learned bias to each attention score based on the relative distance between two tokens. Token at position 5 attending to token at position 2? That's a distance of 3, and it gets its own learned adjustment. ALiBi took this further and made it even cleaner — no learned parameters at all, just a fixed negative penalty proportional to distance. The further apart two tokens are, the more you suppress that attention connection. Simple, interpretable, and it handles longer sequences more gracefully because relative distances stay meaningful even as the sequence grows.

Then came RoPE, and it's basically where everyone landed.

RoPE — Rotary Position Embedding — is used by LLaMA, Mistral, Qwen, DeepSeek, and the vast majority of serious modern language models. The core insight is subtle but powerful: instead of adding position information to embeddings, you rotate it in.

Here's the intuition. Take the Query and Key vectors that attention uses. For a token at position m, you rotate each pair of dimensions in its Query vector by an angle proportional to m. You do the same for the Key vector at position n. Now when you compute the dot product between them — which is what attention scores are — something beautiful happens mathematically: the result depends only on the *difference* between m and n, not on their absolute values. The absolute positions cancel out. What remains is the relative distance.

That's the magic. You get relative position awareness for free, as a mathematical consequence of how rotation works. No extra parameters. No modifications to the attention formula itself. And because the frequency design mirrors the sinusoidal approach, you get the same multi-scale structure — different dimension pairs rotate at different speeds, encoding position at different levels of granularity.

And it extends. Techniques like YaRN — "Yet another RoPE extensioN" — can stretch a model's context window by adjusting those rotation frequencies, essentially making the model believe it's still operating within its training range even when you feed it a much longer sequence. That's how LLaMA 3.1 reached a 128,000 token context window.

So let's zoom out. You've got three distinct eras here. First: absolute positional encoding — sinusoidal or learned — where you stamp each token with its absolute address. Useful, but fragile at long range. Second: relative positional encoding — T5 and ALiBi — where attention itself perceives distance, which handles extrapolation better. Third: rotary encoding — RoPE — where positional information is multiplied into the Query-Key computation through rotation, giving you relative distance, zero parameters, and extensibility all at once.

RoPE's dominance isn't hype. It checks every box simultaneously: you don't pay parameter cost for it, it natively captures relative distance rather than just absolute position, and you can extend it with principled math rather than hoping the model generalizes. When you see researchers talking about long-context fine-tuning or context window extension, they're almost always building on top of RoPE's foundation.

Three things to take away. One: word order is not built into Transformer attention — it has to be explicitly added, and the method you choose matters significantly. Two: the field moved from "add position to embeddings" to "encode relative distance in the attention computation itself," and that shift is why modern models handle long sequences so much better. Three: RoPE won because it combines parameter efficiency, mathematical elegance, and practical extensibility — and understanding *why* the rotation trick works is the key to understanding all the long-context techniques built on top of it.

🇹🇼 中文

「貓咬狗」跟「狗咬貓」,對人類來說完全不同的兩件事。但如果你把這兩個句子丟進 Transformer,在沒有任何位置資訊的情況下,模型看到的其實是一樣的東西——就只是「貓」、「咬」、「狗」三個 token,順序對它來說根本不存在。

這是 Transformer 架構一個很根本的問題。Self-attention 的設計讓每個 token 可以直接關注所有其他 token,非常強大,但這個「全連接」的特性同時也讓它對輸入的排列順序完全不敏感。你把 token 順序打亂,每個 token 的輸出向量只是重新排列,數值完全不變。用術語說,這叫做「排列不變性」。

所以從 2017 年原始 Transformer 論文開始,研究者就必須想辦法把位置資訊塞進去。這個問題的解法演進到今天,大概可以分成四個世代。

---

第一代是**正弦絕對位置編碼**,就是原始論文的方法。概念很簡單:對每個位置,用不同頻率的正弦和餘弦波算出一個向量,然後直接加到 token 的 embedding 上。直覺上你可以想成是用不同頻率的波去描述一個位置——低頻的負責區分「大概在前半還是後半」,高頻的負責「精確在哪個位置」,有點像二進位計數的概念。

這個方法的好處是不需要任何額外的訓練參數,公式直接算。壞處也很明顯:訓練的時候只看過固定長度的序列,一旦推論時超出那個長度,模型就遇到了它從來沒見過的位置向量,表現會急劇下降。

---

第二代是**可學習的絕對位置編碼**,GPT-2 和 BERT 用的都是這個方法。不用固定公式,直接建一個位置向量的查詢表,每個位置對應一個向量,透過訓練學出來。

這樣做讓模型有更多彈性去學習適合任務的位置表示。但本質上的問題還是一樣——查詢表是有大小限制的,訓練時定了幾個位置就是幾個,超出去就沒有了,依然不能外推。而且「位置 1 跟位置 2 的關係」這件事,模型得自己從數據裡去學,不是被顯式建模的。

---

第三代進入了**相對位置編碼**的思路,代表是 T5 和 ALiBi。

T5 的做法是不在 embedding 層加位置,而是在 attention 計算的時候,對每一對 token 直接加一個相對位置的偏置值。這樣 attention 分數本身就帶有「這兩個 token 距離多遠」的資訊。

ALiBi 更激進,概念也更優雅:對每個 attention head,把一個跟相對距離成正比的負數直接加到 attention 的 logit 上。越遠的 token 就有越大的負懲罰,相當於一個隨距離衰減的機制。不需要額外參數,而且在處理更長序列時表現相對穩健。

相對位置編碼的核心洞見是:語言中真正重要的往往不是「這個詞在第幾個位置」,而是「這兩個詞相距多遠」。

---

第四代,也就是目前最主流的方案,是 **RoPE,旋轉位置嵌入**。LLaMA、Mistral、Qwen、DeepSeek,幾乎所有你叫得出名字的現代大語言模型都在用它。

RoPE 的核心思路跟前幾代不一樣。前面幾種方法都是把位置向量「加」到 token embedding 上,RoPE 則是把位置資訊「乘」進去——具體來說,用旋轉矩陣作用在 Query 和 Key 向量上。

每對維度都做一個跟位置相關的旋轉,不同維度的旋轉頻率不同,類似正弦編碼的頻率設計。

這樣做有個非常漂亮的數學性質:當你計算位置 m 的 Query 和位置 n 的 Key 的點積時,結果只依賴這兩個位置的差值,也就是相對距離,而不依賴它們的絕對位置。相對距離的編碼就自然地發生在 attention 計算裡面,完全不需要修改 attention 的公式本身。

工程上 RoPE 還有三個優點同時滿足:第一,免參數,不需要任何額外的可學習向量;第二,天然表達相對距離;第三,可以透過 YaRN 這類技術做長度外推——本質上是調整旋轉的頻率,讓模型以為自己還在熟悉的位置範圍內。Llama 3.1 就是用 RoPE 加上長文本微調,把上下文視窗拉到 128K。

---

回顧整個演進,你可以把這幾代方法的差異想成三個維度:要不要額外參數、能不能外推到更長序列、相對距離有沒有被顯式建模。

早期的正弦編碼和可學習嵌入,都沒有外推能力,相對距離也只是間接編碼。T5 和 ALiBi 解決了相對距離的問題。RoPE 則是把三個需求同時滿足——免參數、直接編碼相對距離、可外推——而且在大量實際訓練中被反覆驗證有效。

這就是為什麼你今天打開任何一個主流開源 LLM 的架構文件,幾乎都會看到 RoPE 這個名字。它的勝出不是偶然,而是在數學上確實更優雅,在工程上確實更實用。

Tags

Related Articles