Table of Contents
DuckDB has been improving faster than most people expected. Engineers who benchmarked it a year ago and set it aside are discovering it has moved considerably since then. This piece covers the actual benchmark numbers, what it compares to, and why the performance ceiling question remains genuinely open.
TL;DR
DuckDB improved group-by performance by more than 12x and join performance by 4x over three years. It can now run TPC-H at SF3,000 (3 TB) in 31 minutes on a 12-core laptop. Its design constraint is single-node embedded analytics—but within that constraint, what is achievable keeps expanding with each release.
What DuckDB Is
DuckDB is an embedded OLAP database with a philosophy similar to SQLite, but aimed at the opposite workload. SQLite is optimized for OLTP (individual record reads and writes); DuckDB is optimized for OLAP (aggregations and scans across large datasets).
“Embedded” means it runs inside your process with no separate server to start:
import duckdb
con = duckdb.connect()
result = con.execute("""
SELECT category, SUM(amount) as total
FROM 'sales.parquet'
GROUP BY category
ORDER BY total DESC
""").fetchdf()
Zero startup time, no network latency, no connection pool to manage. For data scientists replacing pandas, the migration cost is essentially zero.
Why It Matters
Before DuckDB, the local options for analyzing gigabyte-scale data were pandas (limited by RAM, crashes above it) or a local PostgreSQL instance (OLTP architecture, slow for analytical queries). DuckDB fills the gap of “data too large for pandas but not worth spinning up a Spark cluster.”
Key features engineers appreciate:
- Reads Parquet, CSV, and JSON directly without ETL
- Zero-copy interop with pandas and Polars via Apache Arrow
- Full SQL: window functions, LATERAL joins, UNNEST, list comprehensions
- Out-of-core query execution since version 0.10.0 (data can exceed RAM)
How It Achieves This Performance
Three architectural decisions drive DuckDB’s numbers:
Vectorized execution: processes thousands of values at once (a “vector”) rather than row by row, enabling SIMD CPU instructions (SSE, AVX) and minimizing per-row overhead.
Columnar storage: reads only the columns needed for a query, dramatically reducing I/O.
Adaptive query execution: the query plan adjusts dynamically during execution based on observed data distributions, rather than committing to a static plan upfront.
graph LR
subgraph "Query Execution"
SQL["SQL Query"]
Parser["Parser"]
Optimizer["Query Optimizer\n(cost estimation + plan)"]
Exec["Vectorized Executor\n(vector batches)"]
Out["Result Set"]
end
SQL --> Parser --> Optimizer --> Exec --> Out
Exec -->|"Data > RAM\n(v0.10.0+)"| Spill["Disk Spill"]
Spill --> Exec
Performance Numbers (2022–2025)
DuckDB publishes its own benchmarks over time page. Key findings:
- Group by / aggregation: more than 12x improvement over three years—this is the most critical OLAP operation
- Join: 4x improvement over three years
- Memory management (v0.10.0): out-of-core execution unlocked—queries can now handle datasets significantly larger than available RAM
- 2023–2024 window: even after what looked like a plateau, the period shows approximately 20% continued improvement on a zoom-in view
At large TPC-H scale (12-core Framework laptop):
| Scale Factor | Data Size | Runtime |
|---|---|---|
| SF3,000 | ~3 TB | ~31 minutes |
| SF10,000 | ~10 TB | ~4.2 hours |
These are single-machine laptop numbers, not server clusters.
Comparison with Alternatives
| DuckDB | ClickHouse | Polars | pandas | |
|---|---|---|---|---|
| Deployment | Embedded | Server | Embedded | Embedded |
| Target scale | GB–TB (single node) | TB–PB (cluster) | GB–TB (single node) | GB (RAM-limited) |
| SQL support | Full | Full | Partial (LazyFrame) | None (DataFrame API) |
| Out-of-core | Yes (v0.10.0+) | Yes | Yes | No |
| Distributed | No | Yes | No | No |
| OLAP speed | 3–10x faster than pandas | ~4x faster than DuckDB at large scale | Close to DuckDB | Baseline |
Exasol’s benchmark (2025): on 10/30/100 GB TPC-H-style tests, Exasol ran more than 4x faster than DuckDB on a high-spec single machine. However, Exasol is a dedicated server requiring separate deployment—a different use case than an in-process library.
ClickHouse is clearly superior for distributed, high-concurrency scenarios at TB-to-PB scale. DuckDB’s edge is the embedded, zero-setup experience and the aggressive performance it delivers without any operational overhead.
When to Use and When Not To
Good fit:
- Data science and ML preprocessing (replacing pandas for >1 GB files)
- Local analysis of Parquet / CSV data
- Embedded analytics inside Python, Go, or Node.js applications
- Data quality checks in CI pipelines
- ETL transformations across multiple file formats
Poor fit:
- High-concurrency OLTP (many short reads and writes)
- Multi-node distributed real-time analytics
- Multi-user environments requiring fine-grained access control
Summary
DuckDB’s performance ceiling is hard to pin down because the target keeps moving. The design scope is clear: single-node, single-user, embedded OLAP. What is achievable within that scope in the 2025 version is substantially more than what was imaginable in the 2022 version. For most analytics engineers, the question has shifted from “is it fast enough?” to “should I design my workflow around it?”—which is a more interesting problem to have.
References
🇺🇸 English
Here's the podcast script:
---
If you benchmarked DuckDB a year ago and moved on, you might want to take another look — because it has moved considerably since then. This isn't a story about hype. It's about actual benchmark numbers, what they mean, and why the question of DuckDB's performance ceiling remains genuinely open.
Let's start with what DuckDB actually is. Think of it as SQLite's analytical counterpart. SQLite is built for the kind of database work where you're reading and writing individual records — your typical app database. DuckDB flips that entirely. It's built for analytical queries: aggregations, scans across millions or billions of rows, the kind of work data scientists do. And like SQLite, it's embedded — it runs inside your program, no separate server to spin up, no network to traverse, no connection pool to manage. You just import it and start querying.
For data scientists, this matters a lot. The old local workflow was either pandas — which runs fine until your dataset gets big enough to crash your session — or spinning up a local PostgreSQL instance, which is an OLTP database and not really designed for analytical workloads. DuckDB fills that gap: data too large for pandas, not worth launching a Spark cluster. And it reads Parquet, CSV, and JSON directly, no ETL required.
Now, how does it actually get the performance it does? Three architectural decisions stand out. First, vectorized execution — instead of processing one row at a time, it processes thousands of values together in a batch. This lets modern CPUs use their SIMD instructions, the kind of parallel math operations that modern chips are extremely good at. Second, columnar storage — when you run a query that only touches three columns out of fifty, DuckDB only reads those three columns. That alone can cut I/O by an order of magnitude. Third, adaptive query execution — rather than committing to a static plan before the query runs, DuckDB adjusts its strategy based on what it actually sees in the data. And since version 0.10.0, it can also spill data to disk when queries exceed available RAM, which was a significant unlock.
Now for the numbers. DuckDB tracks its own performance over time, and the trajectory is striking. Group-by and aggregation operations — the most critical workload for OLAP — improved by more than twelve times over three years. Join performance improved by four times. And this wasn't a straight-line improvement that plateaued; even in a zoomed-in view of 2023 to 2024, there's still roughly twenty percent continued improvement after what looked like a ceiling.
The most striking demonstration: running TPC-H at SF3000 — that's roughly three terabytes of data — on a twelve-core laptop takes about thirty-one minutes. Push it to SF10000, which is ten terabytes, and you're looking at about four hours. On a laptop. Not a server cluster. A laptop.
How does it compare to alternatives? Polars is close in speed for single-machine workloads, also embedded, no distributed support. pandas is the baseline everyone's replacing — it's RAM-limited and significantly slower. ClickHouse is genuinely faster for distributed, high-concurrency scenarios at terabyte-to-petabyte scale, but it's a server you deploy and operate, not an in-process library. Exasol ran a benchmark in 2025 that showed it running more than four times faster than DuckDB on a high-spec machine — but again, that's a dedicated server deployment, a completely different operational model.
DuckDB's edge isn't claiming to beat everything everywhere. It's that for what it does — embedded, in-process, single-node analytics — the performance gap with "real" analytical databases has been closing fast enough that the question isn't whether it's fast enough anymore. The question is whether you should design your data workflow around it.
Where it's a great fit: data science preprocessing, local analysis of large files, embedding analytics directly into a Python or Go or Node.js application, running data quality checks in CI pipelines. Where it's the wrong tool: anything involving many concurrent users, distributed real-time analytics, or environments where you need fine-grained access control across multiple users.
So here are the three things to take away from this. One: DuckDB's performance trajectory has been more aggressive than most engineers expected — if you haven't revisited it recently, your mental model of its limits is probably out of date. Two: the design constraints are clear and haven't changed — single node, single user, embedded OLAP — but what's achievable within those constraints in 2025 is dramatically more than what was imaginable in 2022. And three: the ceiling question is still genuinely open. Each release has moved the bar. At this point, the more interesting engineering question isn't whether DuckDB is fast enough — it's whether your data architecture is shaped to take advantage of it.
---
🇹🇼 中文
DuckDB 的效能提升速度,最近真的讓很多人跌破眼鏡。三年內,group by 快了 12 倍,join 快了 4 倍——而且這個數字還在動。今天我們就來聊聊它到底做了什麼,以及為什麼在 2025 年,它的天花板仍然是個問號。
---
先說 DuckDB 是什麼。它是一個嵌入式的 OLAP 資料庫,設計哲學跟 SQLite 很像,但方向完全相反。SQLite 是為了單筆記錄讀寫設計的,DuckDB 是為了大量資料的聚合分析。「嵌入式」的意思是它直接跑在你的 process 裡面,不需要開獨立的 server,啟動時間幾乎是零,也沒有網路延遲。
對資料科學家來說,這件事的意義在於:你基本上只需要把原本用 pandas 的地方換成 DuckDB,就能直接處理幾 GB 甚至幾百 GB 的資料,而且它可以直接讀 Parquet、CSV、JSON,不用 ETL,也不用先把資料匯入什麼地方。
---
它的速度從哪裡來?有幾個關鍵設計。
第一個是向量化執行。傳統資料庫一次處理一行,DuckDB 一次處理幾千個值——這讓 CPU 可以充分利用現代處理器的 SIMD 指令,像是 AVX、SSE 這類東西,批次運算效率差很多。
第二個是列式儲存。分析查詢通常只需要幾個欄位,DuckDB 只讀你要的那幾欄,I/O 少非常多。
第三個,0.10.0 版之後,它支援 out-of-core 查詢——也就是資料大過你的 RAM 也能跑,它會在必要時把中間結果寫到磁碟,再繼續執行。這個改動在 2024 年打破了「DuckDB 受限於記憶體大小」的認知。
---
講到具體數字。在一台普通的 12 核心筆電上跑 TPC-H 測試,SF3,000 也就是 3TB 的資料量,DuckDB 可以在 31 分鐘跑完;SF10,000,就是 10TB,大概 4 個多小時。單機筆電跑 10TB 的分析查詢——這個數字放在五年前大概會被笑,但它就是做到了。
而且官方有一個專門追蹤歷次 release 效能變化的頁面,資料很透明。從上面可以看到即使 2023 到 2024 年外觀上看起來已經趨於平穩,細看仍然有穩定的 20% 左右的持續改進。
---
那它跟其他方案怎麼比?
跟 pandas 比:DuckDB 在典型 OLAP 場景快 3 到 10 倍,而且不受 RAM 限制。
跟 Polars 比:速度接近,但 DuckDB 的 SQL 支援更完整,Polars 是 DataFrame API,如果你的團隊習慣 SQL,DuckDB 更自然。
跟 ClickHouse 比:ClickHouse 在高並發、分散式、TB 到 PB 規模的線上分析場景明顯勝出。但 ClickHouse 要獨立部署,DuckDB 是嵌進你的程式裡的——這是兩個不同的使用情境,不是同一場競技。
Exasol 2025 年的測試顯示在 10 到 100GB 規模下它比 DuckDB 快約 4 倍,但 Exasol 同樣是需要單獨部署的 server,對比嵌入式的 DuckDB,本來就不是蘋果對蘋果的比較。
---
什麼時候用 DuckDB 最合適?資料科學前處理、ML 特徵工程、本地分析 Parquet 或 CSV、把它嵌進 Python 或 Go 應用裡做內嵌分析、CI pipeline 裡的資料品質檢查——這些場景都很適合。
什麼時候不適合?高並發的 OLTP、需要多節點分散式的大規模即時分析、需要細緻存取控制的多用戶共享環境——這些不是它的設計目標。
---
收尾來整理三個核心點。
第一,DuckDB 的定位很清晰:單機、單用戶、嵌入式的 OLAP 分析引擎。搞清楚這個邊界,就不會拿它去做它不擅長的事。
第二,在這個邊界內,它的效能三年內提升了 4 到 12 倍,而且還沒有停下來的跡象。2022 年的設計目標到了 2025 年已經被遠遠超過了。
第三,對大多數分析工程師來說,現在真正要想的問題不是「DuckDB 夠不夠快」,而是「我的工作流程有沒有圍繞它設計的理由」——答案很可能是有的。
Tags
Related Articles
How OpenAI Scaled a Single PostgreSQL Instance to 800 Million ChatGPT Users: 50 Read Replicas, PgBouncer, Cascading Replication
OpenAI's ChatGPT database architecture is a single primary PostgreSQL instance with ~50 read replicas, PgBouncer connection pooling, and cascading replication on Azure. The core insight: read-heavy workloads don't need sharding — optimizing the read path is what matters.
System Design Mock: Architecture Decisions for a Book E-Commerce Platform
For a book selling platform, the key decisions are search architecture (Elasticsearch vs full-text search), inventory consistency (strong vs eventual), and order state machine design.
DDIA Chapter 1: Reliability, Scalability, Maintainability — Three Terms Engineers Use Wrong
DDIA Chapter 1's core argument: the challenge of data-intensive systems isn't big compute — it's data complexity (volume, variety, velocity). Evaluating this complexity requires precise definitions of reliability, scalability, and maintainability that are more specific than how most engineers use these terms.