Table of Contents

Since 2023, semiconductor markets have split. Consumer electronics chip demand has been soft; data center memory has been severely supply-constrained. The driver is AI—specifically, the fact that LLM inference is fundamentally memory-bandwidth-bound in a way that traditional computing workloads are not. This piece explains why.

TL;DR

LLM inference requires loading model weights and maintaining KV caches in high-bandwidth memory at all times. This makes HBM (High Bandwidth Memory) the rate-limiting component of AI accelerators. The memory market grew 78% in 2024; HBM supply is committed through 2026, and the upcycle is expected to run at least through 2028.

What HBM Is

HBM stacks multiple DRAM layers vertically using Through-Silicon Vias (TSV) and connects them to a GPU via a silicon interposer. The result is dramatically higher bandwidth and capacity than conventional GDDR memory, at the cost of significantly higher manufacturing complexity and price.

HBM3eGDDR6
Bandwidth (per GPU)1.2+ TB/s768 GB/s
Capacity (per GPU)80–192 GB24–48 GB
Power consumptionLower (short trace lengths)Higher
CostSignificantly higherRelatively lower

NVIDIA’s H100 ships with 80 GB HBM3; the H200 uses HBM3e with up to 141 GB. These specs exist because LLM workloads demand them.

Why AI Creates Unusual Memory Requirements

LLM inference has a compute profile unlike gaming, scientific simulation, or traditional database queries.

Model weights must fit in memory: a 70B-parameter model at fp16 precision requires approximately 140 GB of memory. Those weights need to be accessible to GPU compute cores on every forward pass, which means they must live in HBM—swapping to CPU DRAM introduces latency that is prohibitive for real-time inference.

Inference is memory-bandwidth-bound, not compute-bound: during the decode phase (generating tokens one by one), the transformer attention mechanism repeatedly reads the KV cache and weight matrices. GPU compute cores spend most of their time waiting for data to move from HBM rather than performing calculations. Adding more compute cores does not help; adding more memory bandwidth does.

KV caches grow with context length: during generation, the model must cache key-value states for every token in the context. A 4K-context KV cache is a few gigabytes; a 128K-context cache is orders of magnitude larger. Longer context windows—which users increasingly want—directly translate to more HBM pressure.

graph LR
  subgraph "LLM Inference Memory Requirements"
    W["Model weights\n70B model ≈ 140 GB fp16"]
    KV["KV Cache\ngrows linearly with context"]
    Act["Activation memory\nrelatively small"]
  end

  subgraph "HBM's Role"
    BW["Provide bandwidth\nso GPU cores don't stall"]
    Cap["Provide capacity\nfor weights + cache"]
  end

  W --> Cap
  KV --> Cap
  W --> BW
  KV --> BW

Market Structure

Supply side: HBM production requires complex wafer bonding and TSV processes. Only three manufacturers have commercial capacity:

  • SK Hynix: approximately 62% market share, dominant supplier to NVIDIA (roughly 90% of NVIDIA’s HBM comes from SK Hynix)
  • Micron: approximately 21%
  • Samsung: approximately 17% (yield challenges have kept it behind)

HBM capacity across all three suppliers is essentially committed through the end of 2026.

Demand side: data center DRAM consumption has risen from 32% of global DRAM demand five years ago to approximately 50% in 2025, and is projected to exceed 60% by 2030.

Prices: overall DRAM prices rose 30–60% in 2024-2025; NAND flash prices approximately doubled over the same period.

Cycle duration: Micron projected in December 2025 that tight market conditions would persist beyond calendar 2026. Analyst projections put the HBM total addressable market at approximately $35 billion in 2025, growing to approximately $100 billion by 2028—a 40% CAGR.

Why This Cycle Differs from Previous Semiconductor Cycles

Previous cycles, like the 2021 COVID chip shortage, were demand shocks combined with supply-chain disruption. The current cycle has a different structure.

Demand is structural: hyperscaler AI infrastructure investment is multi-year capital expenditure, not consumer demand that snaps back after a shock. Google, Microsoft, Meta, and Amazon are signing multi-year procurement agreements for both GPUs and HBM.

Supply is genuinely constrained: HBM fab capacity takes two to three years to expand. Yield challenges at Samsung have kept effective supply growth slower than planned.

Demand has stickiness: even if AI investment moderates, deployed models still require HBM for inference. The installed base of AI accelerators creates persistent demand independent of new purchases.

Implications for Engineers

Cloud GPU costs remain elevated: HBM is a significant cost component of H100 and A100 hardware. GPU rental prices will stay high as long as HBM is constrained.

Model compression becomes economically rational: quantization (INT8, INT4), weight pruning, and knowledge distillation are not just academic exercises—they directly reduce HBM requirements, which reduces infrastructure cost.

Memory-efficient attention is a real engineering discipline: FlashAttention, PagedAttention (vLLM), and related techniques aim to reduce KV cache HBM footprint, allowing the same hardware to serve more concurrent requests or handle longer contexts.

Summary

The AI memory supercycle is not hype. It follows directly from a technical property of LLM inference: the decoder is memory-bandwidth-bound, model weights are large, and KV caches grow with context. HBM is not “faster RAM”—it is the enabler that makes large-scale LLM inference possible at all. How long this cycle lasts depends on the pace of AI infrastructure investment and the rate at which HBM production capacity can be expanded.

References

🇺🇸 English

Here's the podcast script:

---

Something interesting happened to the semiconductor market starting around 2023. Consumer chip demand went soft — phones, laptops, the usual stuff — while data center memory became so scarce that the biggest buyers in the world couldn't get enough of it. The reason is AI, and it's not just "AI is hot" hype. There's a very specific technical reason why AI workloads are eating memory in a way nothing before them has.

Let me walk you through it.

When you run a large language model — say, a 70-billion-parameter model — just holding those weights in memory costs roughly 140 gigabytes at standard precision. And those weights can't live in some slow storage tier. They have to be right there, instantly accessible to the GPU compute cores on every single forward pass. If you start swapping to regular CPU memory, your latency blows up and the whole thing becomes useless for real-time inference.

But here's the really counterintuitive part: inference isn't bottlenecked by compute. It's bottlenecked by memory bandwidth.

During the decode phase — when the model is generating tokens one by one — the transformer attention mechanism is constantly reading through the KV cache and weight matrices. The GPU cores are mostly *waiting* for data to show up from memory. They're not crunching numbers; they're standing around. You could throw more compute cores at this problem and it wouldn't help at all. What you need is memory that can move data faster.

And then there's the KV cache problem, which gets worse as models get smarter. Every token in your context window requires the model to store key-value states. A four-thousand token context? A few gigabytes. A hundred-and-twenty-eight-thousand token context — which users increasingly want — is orders of magnitude larger. Longer context equals more memory pressure, full stop.

This is exactly why HBM exists.

High Bandwidth Memory stacks multiple DRAM layers vertically, connects them to the GPU via a silicon interposer, and delivers bandwidth and capacity that conventional graphics memory simply cannot match. Compare them directly: HBM3e on a modern GPU can push over a terabyte per second of bandwidth, versus around 768 gigabytes per second for GDDR6. And capacity-wise, you're looking at 80 to 192 gigabytes on HBM versus 24 to 48 on GDDR. The H100 ships with 80 gigs of HBM3. The H200 steps up to 141 gigs of HBM3e. Those aren't marketing specs — they're the minimum viable configuration for running large models in production.

Now here's where it becomes a market story.

Making HBM is genuinely hard. The vertical stacking and bonding processes are complex enough that only three companies in the world do it commercially: SK Hynix with about 62% of the market, Micron with around 21%, and Samsung with the remaining 17% — though Samsung has been struggling with yield issues that have kept them behind. SK Hynix alone supplies roughly 90% of NVIDIA's HBM. All three manufacturers have their capacity committed through the end of 2026. The memory market grew 78% in 2024. DRAM prices rose 30 to 60 percent. NAND roughly doubled.

What makes this cycle different from, say, the 2021 COVID chip shortage? That was a demand shock plus supply-chain chaos — painful, but temporary. This one has a different structure underneath.

Hyperscalers — Google, Microsoft, Meta, Amazon — are signing multi-year procurement contracts. This isn't consumer demand that snaps back after a shock; it's capital expenditure planned years in advance. And even if AI investment cools at some point, every GPU that's already been deployed still needs HBM to run inference. The installed base creates persistent demand regardless of new purchases. Analyst projections put the HBM addressable market at around 35 billion dollars in 2025, growing toward 100 billion by 2028. That's a 40 percent compounded annual growth rate over three years.

So what does this mean if you're an engineer working with these systems?

GPU rental costs are staying high for exactly this reason — HBM is a major cost component of every H100 and A100 in every cloud. As long as supply is constrained, prices stay elevated.

Model compression stops being academic. Quantization to INT8 or INT4, weight pruning, knowledge distillation — these aren't just clever tricks. They directly reduce how much HBM your model needs, which directly reduces your infrastructure bill. The economics are real.

And memory-efficient attention techniques are now a legitimate engineering discipline. FlashAttention, PagedAttention from vLLM — these exist specifically to shrink the KV cache footprint in HBM, letting the same hardware serve more concurrent users or handle longer contexts. If you're not aware of these techniques, you're leaving performance on the table.

Three things to take away from all this.

First: LLM inference is memory-bandwidth-bound, not compute-bound. That single fact explains most of what's happening in the AI hardware market.

Second: HBM isn't "faster RAM" — it's the enabling technology that makes large-scale inference possible at all. Without it, the largest models simply don't run.

Third: this supercycle has structural legs that previous chip cycles didn't. Multi-year procurement contracts, genuinely constrained supply, and a persistent installed base of accelerators that keeps demanding the product. The current tight conditions are expected to run at least through 2026, with the broader cycle projected through 2028.

The memory story is, in many ways, the AI story.

---

🇹🇼 中文

2023 年以來,半導體產業出現了一個很有意思的分裂:消費性電子的晶片需求疲軟,但資料中心的記憶體卻供不應求。背後的關鍵驅動力是 AI,更精確地說,是大型語言模型在訓練和推論時,對記憶體的需求跟傳統工作負載根本不在同一個量級。

先講 HBM 是什麼。HBM,High Bandwidth Memory,高頻寬記憶體,做法是把多層 DRAM 晶片垂直堆疊,用矽穿孔技術連接,然後緊貼著 GPU 放。跟傳統的 GDDR6 比,差距非常明顯:HBM3e 的頻寬可以超過每秒 1.2 TB,容量從 80GB 到 192GB;GDDR6 頻寬大概 768 GB/s,容量只有 24 到 48GB。而且因為距離更短,HBM 的功耗反而更低。代價是:貴很多。NVIDIA H100 配備 80GB HBM3,這個規格跟消費級顯卡完全不是同一個世界。

那為什麼 AI 對記憶體有這麼特殊的要求?三個關鍵點。

第一,模型權重必須裝進記憶體。一個 700 億參數的大模型,用半精度浮點數存放,需要大約 140GB 記憶體。推論時 GPU 核心必須隨時能讀到這些權重,放到 CPU 記憶體太慢,只能放在 HBM 裡。

第二,推論是記憶體頻寬瓶頸,不是計算瓶頸。Transformer 在生成文字的階段,GPU 核心大部分時間在等記憶體把資料搬過來,自己反而沒在忙——這叫 memory-bound。在這種情況下,加更多計算核心沒有用,提升記憶體頻寬才有效。這是 HBM 為什麼不可替代的核心原因。

第三,KV Cache。LLM 在生成回應時,需要快取之前每一個 token 的 key-value 狀態,上下文越長,快取就越大。4K 上下文已經要幾 GB,現在很多模型支援 128K 甚至更長,記憶體需求急劇膨脹。

這三個因素加在一起,讓 HBM 成為 AI 加速器能夠存在的基礎條件,不是可選的升級項。

看市場結構。HBM 製造難度極高,全球只有三家廠商有量產能力:SK Hynix 約佔 62% 份額,是 NVIDIA 的主要供應商;Micron 約 21%;Samsung 約 17%,良率問題讓它持續落後。NVIDIA H100 和 H200 的 HBM 有九成來自 SK Hynix,而這些產能在 2026 年以前基本上已全部預售完。

需求端,資料中心對 DRAM 的需求佔比,從五年前的 32% 升到 2025 年的約 50%,預計 2030 年超過 60%。整體 DRAM 價格在 2024 到 2025 年上漲 30 到 60%,NAND flash 漲幅接近 100%。多個分析機構預估這個上行周期持續到至少 2028 年,HBM 市場規模預計從 2025 年的 350 億美元成長到 2028 年的約 1,000 億美元,年複合成長率約 40%。

這次跟 2021 年 COVID 晶片荒那種傳統周期很不同。那次是一次性需求衝擊加上供應鏈中斷。這次,Google、Microsoft、Meta、Amazon 都在簽多年期採購合約,是長期資本支出;產能擴充需要兩到三年建廠,Samsung 的良率問題又進一步壓縮有效供給。這種結構不是幾個季度就能平衡的。

對做應用層開發的工程師,這個周期有幾個實際影響。GPU 雲端費用會維持高位,這是 HBM 成本壓下來的。量化、蒸餾、稀疏化這些模型壓縮技術,核心動機之一就是在同樣的 HBM 容量裡跑更大的模型或更多並發。FlashAttention 和 vLLM 的 PagedAttention,目標都是壓縮 KV Cache 的記憶體佔用,讓有限硬體服務更多請求。

總結三個核心點:一,LLM 推論是 memory-bound 而非 compute-bound,這讓 HBM 成為不可替代的必要元件。二,HBM 產能高度集中在少數供應商,擴充慢,供給緊缺至少延續到 2028 年。三,這個超級周期由 AI 基礎設施的結構性需求驅動,對整個產業的成本結構都有長期影響,應用層工程師也得把記憶體效率列為一線考量。

Tags

Related Articles