Table of Contents

In December 2024, Chinese AI company DeepSeek published a technical report that made a lot of people in the AI research community run the numbers twice: they trained a 671B-parameter model using 2.78 million H800 GPU-hours at a cost of approximately $5.57 million. By comparison, GPT-4’s training cost is estimated to exceed $100 million. Comparable performance, roughly one-twentieth the training cost, fully open source. The implications go beyond “cheap AI” — this was a fundamental recalibration of the industry’s assumptions about training efficiency.

TL;DR

DeepSeek V3 is a 671B total-parameter MoE (Mixture of Experts) model that activates only 37B parameters per token. Through innovations including MLA (Multi-head Latent Attention), auxiliary-loss-free load balancing, and multi-token prediction, it was trained in 2.788M H800 GPU-hours at approximately $5.576M. It matches or approaches top closed-source models on multiple benchmarks. API pricing is approximately $0.028 per million input tokens — one-tenth the cost of OpenAI’s equivalent-scale models.

Design Philosophy

DeepSeek’s core question was: where is the efficiency ceiling for AI training?

The mainstream view held that frontier models require massive GPU clusters and astronomical budgets. OpenAI, Google, and Anthropic’s training costs doubled with each generation. DeepSeek took the opposite approach — asking “what can architecture design achieve within a fixed compute budget?”

This thinking shows up in several concrete decisions:

  1. Choose MoE over dense: MoE gives you large parameter count (strong expressiveness) without activating everything during inference (less compute)
  2. Optimize for hardware available in China: The H800 is the export-controlled version of the H100, with lower memory bandwidth. DeepSeek had to optimize cross-node communication under this constraint
  3. Co-design algorithms, framework, and hardware: Rather than assuming the best hardware, squeeze maximum efficiency from existing conditions

Core Concepts

MoE Architecture

In DeepSeek V3’s Transformer architecture, FFN (feed-forward network) layers are replaced with MoE layers. Each MoE layer has 256 expert modules, and each token is routed to 8 of them. Out of 671B total parameters, only ~37B activate per forward pass — making inference compute similar to a 37B dense model while retaining the model capacity of 671B.

DeepSeekMoE improvements:

  • Added “shared experts” on top of standard MoE, ensuring certain common knowledge isn’t routing-dependent
  • Fine-grained experts (256 instead of the traditional 8–16), allowing more precise routing

Multi-head Latent Attention (MLA)

Traditional MHA (Multi-head Attention) KV Cache consumes large amounts of memory for long text. MLA’s innovation is projecting K and V into a low-dimensional latent space before expanding them — dramatically reducing KV Cache memory footprint and memory bandwidth requirements during inference.

This matters especially when running long-context inference on memory-bandwidth-limited H800s.

Auxiliary-Loss-Free Load Balancing

A classic MoE problem is expert collapse — the router tends to send all tokens to a few experts, leaving most experts undertrained. The traditional fix is adding auxiliary loss functions to penalize imbalance, but this interferes with the primary training objective.

DeepSeek V3’s solution adds token-level bias terms before the softmax routing, dynamically adjusted without extra loss functions. Load balancing is equally effective without affecting the model’s main task learning.

Multi-Token Prediction

Traditional language models predict one next token at a time. DeepSeek V3 introduces multi-token prediction (predicting the next N tokens), letting the model learn longer-range dependencies during training and increasing training signal density.

Comparison with Alternatives

ModelTypeActive ParamsTraining Cost (est.)Open SourceAPI per 1M input tokens
DeepSeek V3MoE37B~$5.6MYes$0.028
GPT-4Dense (est.)~1T>$100MNo$10
Claude 3.5 SonnetUndisclosedUndisclosedUndisclosedNo$3
Llama 3.1 405BDense405B>$30M (est.)Yes (partial)Provider-dependent
Mistral LargeDense123BUndisclosedNo$3

DeepSeek V3’s pricing is approximately 107x cheaper than Claude Sonnet and 357x cheaper than GPT-4 — making large-scale deployment cost structures look completely different.

When to Use It (and When Not To)

Good fit:

  • Commercial applications making high volumes of API calls (cost advantage most pronounced)
  • Code generation, mathematical reasoning, long-form text (V3’s strengths)
  • Local deployment with limited compute (MoE inference compute is close to a 37B dense model)
  • Research purposes (full technical report and model weights available)

Not a good fit:

  • Applications requiring the strictest data privacy (model from a Chinese company, API deployed on Chinese servers)
  • Real-time voice interaction (not a speed strength for inference)
  • Medical or legal applications requiring maximum accuracy (gap vs. GPT-4 o1/o3 reasoning capability)

The Big Picture

DeepSeek V3 changed the cost reference point for AI training. It’s not saying “billion-dollar systems have no value” — it’s saying “certain performance levels don’t require billions.”

The industry impact is already visible: OpenAI, Anthropic, and Google all accelerated their cheaper model offerings, and API pricing dropped continuously through 2025. DeepSeek’s contribution isn’t just a good model — it’s the complete open publication of MoE efficiency optimization research, giving the whole community a foundation to build on.

DeepSeek V4’s technical preview was released in April 2026, and is worth watching.

References

🇺🇸 English

In December 2024, a Chinese AI lab published a technical report that made researchers across the industry do a double-take. DeepSeek had trained a massive 671-billion-parameter model — and the total bill came to about $5.6 million dollars. For context, GPT-4's training cost is estimated at over $100 million. Comparable performance. One-twentieth the cost. Fully open source.

This wasn't just "cheap AI." It was a fundamental challenge to everything the industry assumed about what frontier AI actually requires.

So let's talk about how they did it — and why it matters.

---

The core of DeepSeek V3 is something called a Mixture of Experts architecture, or MoE. Here's the intuition: instead of one giant neural network where every parameter fires for every input, you have hundreds of specialized sub-networks — "experts" — and each piece of text only activates a small subset of them.

DeepSeek V3 has 671 billion total parameters, but only about 37 billion activate for any given token. So the actual compute during inference is closer to running a 37-billion-parameter model, even though you're drawing on the knowledge capacity of something nearly twenty times larger. You get the expressiveness of a massive model without paying the full inference cost every single time.

They also pushed the expert design further than anyone had before — using 256 fine-grained experts per layer instead of the typical 8 to 16, and adding a layer of "shared experts" that always activate regardless of routing. This ensures common knowledge isn't fragmented across specialists.

---

Now, one of the persistent headaches with Mixture of Experts models is what's called expert collapse. The routing mechanism — the part that decides which experts handle which tokens — tends to get lazy. It starts sending everything to the same few experts, leaving most of the network undertrained and wasted. The traditional fix is adding penalty terms to the training loss to force better distribution, but that creates its own problem: you're now fighting against your own training objective.

DeepSeek's solution was elegant. Instead of adding extra loss functions, they introduced small bias terms that get dynamically adjusted during training to keep load balanced — without touching the main learning signal at all. The model stays focused on its actual job, and the routing balances itself out in the background.

---

Another innovation worth understanding is what they call Multi-head Latent Attention. In standard attention mechanisms, the memory required to handle long text grows significantly — you're storing what's called a KV cache, which tracks context across the entire sequence. On memory-bandwidth-limited hardware — and remember, DeepSeek was working with H800s, the export-controlled version of Nvidia's H100 with lower memory bandwidth — this becomes a real bottleneck.

MLA's trick is projecting the attention keys and values into a compressed, low-dimensional space before expanding them back out. You dramatically reduce the memory footprint without losing the expressiveness. For long-context inference, this is a meaningful efficiency gain.

---

There's one more piece: multi-token prediction. Standard language models predict one token at a time — each forward pass produces the next word. DeepSeek V3 trains to predict multiple future tokens simultaneously. This does two things: it forces the model to learn longer-range dependencies, and it packs more learning signal into each training step. You're getting more out of every GPU-hour.

---

Let's talk numbers for a second, because the cost story is genuinely striking.

DeepSeek V3 at API pricing runs about $0.028 per million input tokens. Claude 3.5 Sonnet is around $3. GPT-4 is around $10. That's not a small difference — DeepSeek is roughly 100 times cheaper than Claude and over 350 times cheaper than GPT-4 at scale. If you're building a product that makes millions of API calls, that pricing gap completely reshapes your unit economics.

On benchmarks — coding, math, long-form reasoning — V3 matches or closely approaches the top closed-source models. It's not universally better, but it competes in the same tier.

---

Where does it make sense to use it? High-volume API applications get the most obvious benefit from the cost structure. Code generation and mathematical reasoning are particular strengths. Local deployment is viable because the active parameter count is manageable. And for researchers, the full technical report and model weights are public — you can actually read exactly how they built this.

Where does it fall short? If your application needs the strongest possible reasoning — the kind GPT-4 o1 and o3 targets — there's still a gap. Real-time voice interaction isn't a strength. And if data sovereignty is a hard requirement, the fact that this is a Chinese company with API infrastructure on Chinese servers is a legitimate consideration depending on your use case.

---

Here's the bigger picture takeaway.

DeepSeek V3 didn't prove that billion-dollar AI systems have no value. What it proved is that certain performance levels don't require billions. And once that becomes visible, it changes competitive dynamics across the industry. OpenAI, Anthropic, and Google all moved faster on cheaper model tiers through 2025. API pricing dropped consistently. The whole market recalibrated.

The contribution isn't just the model itself — it's the detailed, public documentation of every architectural decision, every efficiency technique, every tradeoff. The community now has a concrete blueprint for MoE optimization that didn't exist before.

Three things to carry with you: First, MoE architecture lets you scale model capacity without proportionally scaling inference cost — that's the fundamental insight. Second, load balancing and attention efficiency were the two places where DeepSeek found the most headroom that others had left on the table. And third, the open publication of this work accelerated the entire field — which means even if you never use DeepSeek directly, you're already benefiting from what they figured out.

🇹🇼 中文

2024 年 12 月,DeepSeek 丟出了一顆震撼彈——他們用大概五百五十萬美元訓練了一個頂級模型。同期的 GPT-4,訓練成本估計超過一億美元。同等效能,二十分之一的成本,完全開源。AI 研究圈很多人看完報告之後,第一個反應是——重算一遍數字。

這個模型叫 DeepSeek V3,有 671 億參數,但它不是一般的 Dense 模型。它用的是 MoE 架構,也就是 Mixture of Experts,「專家混合」。

來解釋一下 MoE 是什麼。你可以想像模型內部有 256 個「專家模組」,每次處理一個 token,只會啟動其中 8 個最適合的專家來回答。總參數 671 億,但每次實際運算只用到大約 37 億。這讓它的推論計算量接近一個 37B 的小模型,但知識容量卻是 671B 的大模型。大腦很大,但每次思考不用全開——這是 MoE 的核心優勢。

DeepSeek 做了幾個關鍵的架構創新,我一個一個講。

第一個是 MLA,Multi-head Latent Attention。傳統的注意力機制在處理長文本時,KV Cache 會吃掉大量記憶體——你想像成「對話記憶」越長,佔的空間越大。MLA 的做法是把 Key 和 Value 壓縮到一個低維的隱空間再展開,記憶體佔用大幅下降。這對跑長文本特別重要,尤其是在記憶體頻寬本來就比較低的 H800 上。

對,H800 是中國能拿到的 GPU,是 H100 的出口管制版。DeepSeek 不是在理想硬體上做研究,而是在限制條件下把效率榨到極致。這個出發點本身就很不一樣。

第二個創新是負載均衡的新方法。MoE 有個老問題:路由器很容易偷懶,把大部分 token 都送給同幾個熱門專家,冷門專家完全沒被訓練到。傳統解法是加懲罰項,但這個懲罰會干擾主要的訓練目標,像是在跑步的時候腳上綁了沙袋。DeepSeek V3 的方案是在路由計算之前加入動態偏置項,自動調整各專家的接收量,不需要額外的懲罰函數,均衡效果一樣好,主任務不受干擾。

第三個是多 Token 預測。一般語言模型一次只預測下一個 token。V3 訓練時同時預測未來好幾個 token,這讓模型學到更長程的依存關係,訓練信號也更密集——同樣的算力,學到更多。

這些加起來,用了 278 萬 H800 GPU 小時,花了大概 557 萬美元,在多項標準測試上達到接近 GPT-4 的水準。

來看定價對比,因為這才是很多人最在意的部分。DeepSeek V3 的 API 每百萬 input token 大約 0.028 美元。GPT-4 是 10 美元,Claude Sonnet 是 3 美元。算一下:DeepSeek 比 Claude Sonnet 便宜大約一百倍,比 GPT-4 便宜將近三百五十倍。這不是小幅折扣,這是成本結構的重新定義。

那什麼情況適合用?需要大量 API 呼叫的商業應用、程式碼生成、數學推理、長文本處理,這些都是 V3 的強項。想要本地部署又算力有限的,也很適合,因為 MoE 推論的實際計算量只有 37B 的量級。

什麼情況不適合?如果你的應用需要最嚴格的資料隱私,要注意這是中國公司的模型,API 部署在中國伺服器。如果你需要 o1、o3 那種深度推理能力,V3 和那個量級還是有差距。

整體來說,DeepSeek V3 改變的不只是一個價格標籤,而是整個產業對「頂級模型需要多少錢」的預期。它的影響已經看得到——2025 年 AI API 定價整體下滑,各家都在加速推出更便宜的選項。

這件事有三個核心要點值得記住:第一,MoE 架構讓大參數量和低計算成本可以並存;第二,算法、框架、硬體的協同設計,在受限條件下反而逼出了更高的系統效率;第三,完整開源的技術報告讓整個社群都能站在這個基礎上繼續推進,這個貢獻比模型本身更長遠。

DeepSeek V4 在 2026 年四月已經釋出技術預覽,值得持續關注。

Tags

Related Articles