Table of Contents

If you’ve used ChatGPT or any Transformer-based language model, KV Cache is why the response speed is tolerable. Without it, the model would need to recompute attention over all previous tokens every time it generates a new word — for long text, that’s quadratic compute blowing up fast. This article explains what KV Cache is, why it works, and how modern LLMs continue to push its limits within memory constraints.

TL;DR

KV Cache stores the Key and Value matrices for all previously seen tokens during Transformer self-attention. When generating each new token, only the current token’s Q needs to be computed; the model then reads from the cache to perform attention. This reduces the time complexity of autoregressive generation from O(n²) to O(n) per step — at the cost of memory growing linearly with sequence length, which becomes a GPU memory bottleneck for long-context inference.

Design Philosophy

Understanding KV Cache requires understanding what self-attention actually does.

For each token in the input sequence, self-attention computes three vectors:

  • Query (Q): “What am I looking for?”
  • Key (K): “What can I be found for?”
  • Value (V): “If someone finds me, what do they get?”

The attention score is the dot product of Q with each K, softmaxed, then used to weight-sum over V:

Attention(Q, K, V) = softmax(QK^T / √d_k) × V

In autoregressive generation, the model generates one token at a time. When generating token n, its Q must dot-product with the K of all n-1 previous tokens. Without caching, every new token forces a full recomputation of all previous K and V matrices — that’s where the O(n²) problem comes from.

The KV Cache solution is straightforward: keep the computed K and V around. Each new token only computes its own Q, K, V, appends the new K and V to the cache, then uses Q to attend over the full cache.

Core Concepts

Cache Structure

At token t generation time:
KV Cache = { K_1, K_2, ..., K_{t-1},
             V_1, V_2, ..., V_{t-1} }

Steps:
1. Compute Q_t, K_t, V_t for current token
2. Append K_t to cache → K_1...K_t
3. Append V_t to cache → V_1...V_t
4. Attention(Q_t, [K_1...K_t], [V_1...V_t]) → output

Each token’s computation is O(n) (dot-producting against all K/V in cache). The total computation for a sequence of length n is O(n²), but amortized per token it’s O(n), avoiding redundant recomputation.

Memory Footprint

KV Cache memory usage is:

KV Cache size = 2 × n_layers × n_heads × d_head × seq_len × bytes_per_element

For Llama 3.1 8B (32 layers, 32 KQ heads, 128 dimensions, float16):

  • Per token: 2 × 32 × 32 × 128 × 2 bytes = 524,288 bytes ≈ 0.5 MB
  • Cache for 128K context: 128K × 0.5 MB = 64 GB

This is why KV Cache fills GPU memory for long-context inference. An A100 (80GB VRAM) running a 128K-context 8B model uses 64GB just for the cache — barely any room left for the model weights.

Prefill vs. Decode Phases

LLM inference has two distinct phases:

Prefill: The entire input prompt is processed at once, building the initial KV Cache. This phase is compute-intensive; GPU utilization is high.

Decode: Generates one token at a time, reading historical KV from cache. Compute demand is relatively low, but it’s bottlenecked by memory bandwidth (reading large caches from GPU HBM). For long sequences, the decode phase bottleneck is memory bandwidth, not compute.

Comparison with Alternatives

ApproachProsConsUse Case
Full KV CacheSimplest, fastestMemory grows linearly with sequenceShort to medium sequences
Multi-Query Attention (MQA)KV cache shrinks by n_headsRequires retrainingLatency-sensitive inference
Grouped-Query Attention (GQA)Balances MHA quality with MQA efficiencyAlso requires retrainingLlama 3, Mistral, modern models
Multi-head Latent Attention (MLA)Cache compressed to low-dim latent spaceExtra projection at inference timeDeepSeek V3
Sliding Window AttentionO(w×n) instead of O(n²)Loses long-range dependenciesSome Mistral layers
KV Cache EvictionCaps memory usageMay drop important tokensMorphKV and research

GQA (Grouped-Query Attention) is the most widely deployed optimization today. In Llama 3.1, the 32 Query heads are split into 8 groups that each share a single KV pair — reducing KV Cache size to 1/4 with minimal quality loss.

MLA (Multi-head Latent Attention) is DeepSeek V2/V3’s innovation. It projects K and V into a low-dimensional latent vector, caches the compressed representation, and expands at inference time. In theory this can reduce cache size by 5–13×, but implementation complexity is higher.

When KV Cache Helps (and When It Doesn’t)

KV Cache at its most effective:

  • Long conversations (multi-turn Q&A where each turn includes full history)
  • RAG pipelines (analyzing documents after retrieval)
  • Batch inference where multiple requests share a prefix (prefix caching optimization applies)

Where KV Cache hits its limits:

  • Extremely long text (100K+ tokens): cache alone exceeds GPU memory
  • High-concurrency serving (many simultaneous users): each session needs its own cache
  • Edge deployment (extremely memory-constrained): cache size requirements hard to satisfy

The Bottom Line

KV Cache is the foundational technology that makes modern LLM inference viable. Without it, every ChatGPT response would be orders of magnitude slower. But it’s also the key bottleneck limiting long-context LLM deployment at scale.

The research direction in 2025 focuses on three areas: compression (GQA, MLA), eviction (intelligently dropping unimportant token caches), and distributed caching (PagedAttention, vLLM’s cache management). Progress in these areas directly determines the cost and latency envelope of LLM serving.

References

🇺🇸 English

Here's the script for KV Cache: The Most Critical Optimization in LLM Inference.

---

Every time you use ChatGPT and it actually responds at a reasonable speed, there's one piece of engineering doing most of the heavy lifting. It's called KV Cache. And once you understand it, you'll never look at a language model the same way.

Let's start with the problem it solves.

Transformer-based language models generate text one token at a time. A token is roughly a word or part of a word. When the model generates each new token, it runs a process called self-attention — which is how the model figures out which parts of the conversation so far are relevant to what comes next.

Here's how self-attention works at a high level. For every token in the sequence, the model computes three things: a Query, a Key, and a Value. Think of it like a search engine built into every word. The Query is asking "what am I looking for?" The Key is saying "here's what I can be found for." And the Value is "if you find me, here's what you actually get."

To generate each new token, the model takes its Query and compares it against the Keys of every single previous token, then uses those scores to pull a weighted blend from all the Values. That's how context flows through the model.

Now here's the problem. Without any optimization, every time the model generates one new token, it has to recompute all of those Keys and Values for every previous token from scratch. If your conversation is a hundred tokens long, you're doing a hundred units of work for token 101. For token 102, you're doing a hundred and one units of work. And so on. That compounds into a quadratic blowup — the longer the text, the dramatically worse the compute gets.

KV Cache cuts through this entirely. The insight is simple: once you've computed the Key and Value for a token, that information doesn't change. The token is in the past. So why throw it away? Just keep it.

With KV Cache enabled, the model stores every Key and Value it has ever computed in a running cache. When it's time to generate the next token, it only needs to compute the Query, Key, and Value for that single new token — then it reads the full history of Keys and Values straight from the cache. No recomputation. The cost per token drops from growing-with-context to roughly constant.

This is what takes the complexity from quadratic down to linear per step. That's not a minor improvement. That's the difference between "unusable for anything longer than a paragraph" and "functional in production."

Now, nothing comes free. The tradeoff here is memory. That cache has to live somewhere, and it lives on GPU memory — which is expensive and limited.

Let's put some real numbers on this. Take Llama 3.1 8B, one of the popular open models. It has 32 layers, each with 32 attention heads and 128 dimensions per head, storing data in 16-bit floats. Per token, the KV Cache consumes about half a megabyte. That sounds small, but scale it to a 128-thousand-token context window and you're looking at 64 gigabytes of cache — for one session. An A100 GPU with 80 gigabytes of memory would be nearly maxed out on cache alone before you've even loaded the model weights.

This is why long-context inference is hard. It's not the computation — it's the memory.

There's also a distinction worth understanding between two phases of inference. The first phase is called prefill. This is when the model processes your entire input prompt all at once, building the initial cache. It's compute-heavy, the GPU is working flat out, and it happens fast relative to the output.

The second phase is called decode. This is where the model generates tokens one by one, reading from that cache each time. Interestingly, the compute demand here is actually pretty light — the bottleneck becomes memory bandwidth. Reading a giant cache off GPU memory takes time, and for long sequences, that's where the latency lives.

So the active research question becomes: how do we shrink the cache without breaking the model?

A few approaches have gotten real traction. Grouped-Query Attention, or GQA, is probably the most widely deployed today. The idea is that instead of every attention head having its own separate Key and Value, you group the heads together so that a cluster of them shares a single KV pair. Llama 3 and Mistral both use this. In Llama 3.1 specifically, 32 Query heads are organized into 8 groups — so the KV Cache shrinks to one-quarter the size, with barely any quality loss. That's a meaningful win in a memory-constrained world.

DeepSeek took a different approach with something called Multi-head Latent Attention. Instead of caching the full Keys and Values, you compress them down into a much smaller latent representation, cache that, and then expand it back out at inference time. The theory is a five-to-thirteen times reduction in cache size. The implementation is more complex, but DeepSeek V2 and V3 ship with this, and it's part of why those models punch above their weight at deployment scale.

There's also work on cache eviction — the idea that not every token in your context is equally important. You can intelligently drop Keys and Values for tokens that seem less relevant, capping memory usage without necessarily losing critical information. This is still an active research area.

Let me land on three things worth taking away from all of this.

First, KV Cache is the reason LLMs are deployable at all. Without it, the compute cost of autoregressive generation would be entirely impractical for anything beyond toy examples.

Second, memory is now the core constraint, not compute. The research frontier in 2025 is fundamentally about managing, compressing, and intelligently evicting that cache — because whoever solves this most elegantly wins on cost and latency at scale.

And third, the architectural choices baked into a model — things like GQA or MLA — have direct operational consequences. When you see a model described as efficient for long-context inference, it almost always traces back to what it's doing with the KV Cache.

---

🇹🇼 中文

如果你用過 ChatGPT,有沒有想過——為什麼它能那麼快生成文字?每個字幾乎是即時冒出來,而不是等個幾秒鐘?背後有一個關鍵技術叫 KV Cache,今天來聊聊它。

先從問題開始說起。Transformer 模型在生成文字時,是一個字、一個字地輸出的,這叫自回歸生成。生成第 N 個字的時候,模型需要「回頭看」前面所有字的關係——用注意力機制來決定現在應該輸出什麼。

注意力機制的核心是三個東西:Query、Key、Value。你可以把它想成一個查詢系統:Query 是「我現在想找什麼」,Key 是「每個位置有什麼可以被查」,Value 是「查到之後拿走什麼資訊」。模型把當前的 Query 跟所有位置的 Key 做比對,算出相似度,再加權提取 Value。

問題來了——如果不做任何優化,每生成一個新字,前面所有字的 Key 和 Value 都要重新算一遍。序列越長,重複計算越多,複雜度是 O(n²)。100 個字的時候還好,10000 個字的時候就災難了。

KV Cache 的解法非常直觀:算過的就存起來,下次直接讀。每個新 token 只需要計算自己的 Query、Key、Value,然後把新的 Key 和 Value 追加到快取,用自己的 Query 去查整個快取就好。每一步的計算量從重算整個序列,降到只算當前這一個 token——整體複雜度從 O(n²) 降到 O(n)。

這個優化聽起來簡單,但效果是決定性的。沒有 KV Cache,你在 ChatGPT 上等的不是秒,是分鐘。

當然,天下沒有白吃的午餐。快取要佔記憶體,而且隨著序列長度線性增長。算個具體數字:以 Llama 3.1 8B 這個模型為例,32 層、每層 32 個注意力頭,每個 token 大概要吃掉半個 MB 的記憶體。如果你跑 128K 長度的上下文,光 KV Cache 就要 64GB——一張 A100 顯卡的記憶體幾乎全用完了,還沒算模型本身的權重。

這就是為什麼長文本推論是個大難題。

推論過程分兩個階段。第一個叫 Prefill,把整個輸入的提示詞一次性計算,建立初始快取,這個階段 GPU 在全力跑計算。第二個叫 Decode,就是一個字一個字生成的過程,每步計算量不大,但需要不停從 GPU 記憶體讀取大量快取資料——瓶頸從算力變成了記憶體頻寬。

針對記憶體問題,現在有幾個主流的優化方向。

最廣泛採用的是 GQA,也就是 Grouped-Query Attention,Llama 3、Mistral 都在用。它的思路是:不是每個注意力頭都需要自己獨立的 Key 和 Value,把多個 Query head 分組,每組共用一套 KV。Llama 3.1 用 8 組,KV Cache 縮小到原來的四分之一,但效能損失很小。

另一個更激進的方案是 DeepSeek V2 和 V3 用的 MLA,把 Key 和 Value 壓縮成低維的隱向量存起來,需要的時候再展開。理論上快取可以壓縮 5 到 13 倍,代價是實作複雜度更高。

還有一類方向是快取淘汰——智慧地判斷哪些 token 的快取不那麼重要,主動丟掉,控制記憶體上限。這個方向目前還在研究階段。

最後總結三個核心點。

第一,KV Cache 是把自回歸生成的重複計算砍掉,讓每一步只算當前 token,這是現代 LLM 推論速度可接受的根本原因。

第二,它的代價是記憶體隨序列長度線性增長,長文本場景下這是真實的 GPU 記憶體瓶頸,不是理論上的問題。

第三,現在的研究方向主要是三條路:壓縮 KV 的大小(GQA、MLA)、智慧淘汰不重要的快取、以及更好的分散式快取管理。這些技術的進展,直接決定了 LLM 服務能跑多長的上下文、能同時服務多少用戶,以及成本能壓到多低。

Tags

Related Articles