What Is a Data Lakehouse? From Data Warehouses to Open Table Formats

Table of Contents

Data infrastructure has taken a winding path over the last decade. Data warehouses dominated, then data lakes emerged as a cheap way to store massive amounts of raw data — only to create years of headaches around governance and query performance. The Data Lakehouse is the architecture pattern trying to end that tug-of-war: the same storage layer delivers both data warehouse reliability and data lake flexibility.

TL;DR

Data Warehouse: structured, high-performance, high-cost; schema-on-write makes schema changes painful
Data Lake: low cost, flexible; but lacks ACID, hard to govern, slow queries
Data Lakehouse: adds an open table format on top of object storage (S3/GCS), getting the benefits of both
Main implementations: Apache Iceberg (cross-engine interop) and Delta Lake (Databricks-ecosystem-first)
2025 trend: Delta Lake UniForm enables both formats to be read by the same engine — “write once, read anywhere”

What Is It

A Data Lakehouse is an architectural pattern, not a specific software product. The core idea:

Layer a transactional metadata format on top of open-format object storage.

The traditional approach runs data lakes and data warehouses in parallel, with ETL pipelines moving data between them — introducing latency and consistency problems. Lakehouse collapses this into a single layer: data lives in Parquet or ORC files on S3/GCS/ADLS, and open table formats like Apache Iceberg or Delta Lake provide ACID semantics, time travel, and schema evolution on top.

graph LR
    A[Data Sources] --> B[Object Storage S3 / GCS]
    B --> C[Open Table Format Iceberg / Delta Lake]
    C --> D[Metadata and Transaction Log]
    C --> E[Spark]
    C --> F[Trino / Presto]
    C --> G[Snowflake]
    C --> H[BigQuery]
    D --> I[ACID Transactions]
    D --> J[Time Travel]
    D --> K[Schema Evolution]

Why It Matters

Data warehouses suffer from cost and flexibility problems: most cloud warehouses couple compute and storage, making scaling expensive; strict schema requirements make ML and AI workloads difficult to accommodate.

Data lakes suffer from reliability problems: no ACID transactions means concurrent writes can corrupt data; no schema validation means data quality is hard to guarantee; small-file proliferation periodically requires manual maintenance to restore query performance.

Lakehouse solves:

Cost: data stays in cheap object storage; compute is on-demand
ACID: table format transaction logs ensure atomicity
Multi-engine: one copy of data can be read simultaneously by Spark, Trino, Snowflake, DuckDB
ML/AI-friendly: unstructured and semi-structured data can coexist with structured data

How It Works

Taking Apache Iceberg as an example, its metadata layer has three tiers:

Metadata files: record table schema, partition spec, and snapshot history
Manifest lists: each snapshot maps to a manifest list recording which data files belong to it
Manifest files: record the path, row count, and statistics (min/max values) of each Parquet data file

Query engines walk Metadata → Manifest list → Manifest files before scanning, using statistics for partition pruning and file pruning — dramatically reducing the data that needs to be scanned. This design also enables atomic table version switching across engines without a centralized lock manager.

Delta Lake’s architecture is similar but more Spark-centric: a _delta_log/ directory at the table root stores JSON-format transaction records, with Parquet checkpoints generated every 10 versions for faster loading.

Compared to Data Warehouses and Data Lakes

Dimension	Data Warehouse	Data Lake	Data Lakehouse
Storage format	Proprietary	Open (Parquet/ORC)	Open format
Storage cost	High	Low	Low
ACID transactions	Yes	No	Yes (via table format)
Schema	Strict (write-time)	Flexible (read-time)	Evolvable
Multi-engine access	Difficult	Easy	Easy
Streaming	Limited	Difficult	Supported (Iceberg v2+)
ML/AI workloads	Difficult	Convenient	Convenient

Apache Iceberg vs Delta Lake

	Apache Iceberg	Delta Lake
Origin	Netflix → Apache Software Foundation	Databricks → Linux Foundation
Design focus	Cross-engine interop, large-scale partitioning	Spark performance, DML simplicity
Catalog	Multiple (Hive, Nessie, REST)	Primarily Unity Catalog
Engine support	Snowflake, Dremio, BigQuery, Flink	Primarily Databricks, Spark
Format interop	Iceberg v3 can read Delta	Delta UniForm publishes Iceberg metadata

The 2025 convergence trend: Delta Lake’s UniForm feature lets a Delta table simultaneously expose Iceberg-compatible metadata, so any Iceberg-capable engine can read it as if it were native. Write in Delta, read from Snowflake — “write once, read anywhere.”

Summary

The Data Lakehouse is no longer a concept — it’s the default starting point for most data engineering teams designing new systems in 2025. Choosing between table formats:

Primarily Databricks ecosystem → Delta Lake
Need multi-engine interop (Snowflake + Spark + Trino) → Apache Iceberg
Want both → Delta UniForm or Iceberg with a multi-engine catalog

Regardless of which you pick, the underlying principle is the same: put reliability in the metadata layer, leave flexibility and low cost in object storage.

References

🇺🇸 English

Data infrastructure has had a messy decade. First, data warehouses were king — structured, reliable, fast. Then data lakes showed up promising cheap, flexible storage for massive amounts of raw data. And for a while, most companies ran both in parallel, with ETL pipelines constantly shuffling data between them. The result? Latency, consistency nightmares, and engineering teams maintaining two completely different systems.

The Data Lakehouse is the architectural pattern designed to end that tug-of-war.

Here's the core idea: instead of keeping a data warehouse and a data lake side by side, you collapse them into one layer. Your data lives in open file formats — like Parquet — sitting on cheap object storage, think S3 or Google Cloud Storage. And then you put an open table format on top of that storage layer to give you all the reliability features you expect from a warehouse: ACID transactions, time travel, schema evolution. One storage layer doing the job of two systems.

Now, to understand why this matters, let's quickly recap what made the old approaches painful.

Data warehouses are reliable and fast, but they're expensive. Most cloud warehouses tightly couple compute and storage, so scaling one means scaling both. And their strict schema requirements — where you define the structure before writing data — make them hostile to the messy, evolving data that machine learning and AI workloads actually need.

Data lakes flipped this around. You can dump anything in there cheaply. But without ACID transaction guarantees, concurrent writes can silently corrupt your data. There's no schema validation, so data quality becomes anyone's guess. And small-file proliferation — where thousands of tiny files accumulate over time — kills query performance until you do manual maintenance. Flexible, yes. Reliable, no.

The Lakehouse addresses all of this. Your data stays in cheap object storage, so costs stay low. But an open table format sits on top and manages a transaction log that enforces atomicity — either a write fully succeeds or it doesn't happen at all. And because the data files themselves are in open formats, multiple query engines can read the same data simultaneously. Spark, Trino, Snowflake, DuckDB — they all read the same Parquet files without you making copies.

Let's get into how this actually works under the hood, using Apache Iceberg as the example.

Iceberg manages a three-tier metadata layer. At the top, metadata files describe the table schema, how it's partitioned, and a history of snapshots. Below that, each snapshot points to a manifest list — essentially a directory of which data files belong to that snapshot. And each manifest file records the path, row count, and statistics like minimum and maximum column values for every individual Parquet file.

When a query engine runs a query, it walks down this hierarchy: metadata, then manifest lists, then manifest files. At each step, it uses those statistics to skip entire files that can't possibly contain relevant data. This is called partition pruning and file pruning, and it's what makes Iceberg fast even on massive datasets. And because table versioning is managed through metadata pointers rather than a centralized lock manager, multiple engines can read different snapshots simultaneously without blocking each other.

Delta Lake, the other major player, takes a similar approach but with a more Spark-centric design. Instead of a tiered metadata hierarchy, it stores a transaction log as a series of JSON files in a directory at the table root. Every ten versions, it writes a Parquet checkpoint file to speed up log loading. Simpler to understand, and deeply integrated with the Databricks ecosystem.

So which one do you pick?

Apache Iceberg came out of Netflix and is now an Apache Software Foundation project. Its design philosophy is cross-engine interoperability — it supports multiple catalog systems and is natively supported by Snowflake, Dremio, BigQuery, Flink, and others. If your architecture spans multiple query engines, Iceberg is the safer bet.

Delta Lake came from Databricks and moved to the Linux Foundation. It's optimized for Spark performance and makes common data manipulation operations simpler. If you're deeply in the Databricks ecosystem, Delta Lake is the path of least resistance.

But here's the 2025 development that changes the calculus somewhat: Delta Lake's UniForm feature. UniForm lets a Delta table simultaneously expose Iceberg-compatible metadata. So any engine that speaks Iceberg — Snowflake, Trino, whatever — can read a Delta table as if it were native Iceberg. Write in Delta, read from anywhere. The two formats are converging rather than staying isolated.

Three things to take away from all of this.

First, the Lakehouse isn't a product you buy — it's an architectural pattern. The principle is simple: put reliability in the metadata layer, leave flexibility and low cost in object storage.

Second, by 2025 this is no longer a cutting-edge experiment. It's the default starting point for most data engineering teams building new systems. The question isn't whether to use a Lakehouse — it's which table format fits your engine mix.

And third, the format war is softening. Delta UniForm and Iceberg's multi-engine catalog support mean you don't have to pick one and lock yourself in forever. The underlying files are the same Parquet format either way. The metadata layer is increasingly interoperable. That's genuinely good news for teams who don't want to bet everything on one vendor's ecosystem.

🇹🇼 中文

資料基礎設施在過去十年走了一段彎路。資料倉儲主宰了很長一段時間，後來資料湖以低成本儲存大量原始資料的姿態崛起，但資料湖在治理和查詢效能上的問題讓工程師頭痛多年。今天要聊的 Data Lakehouse，就是試圖終結這場拉鋸的架構模式。

先快速回顧一下這三個東西的差異。

資料倉儲的優點是結構化、高效能、ACID 事務都有，但成本高，而且 Schema 定義很嚴格，要改資料結構很痛苦，機器學習和 AI 需要的非結構化資料根本放不進來。

資料湖走另一個極端：低成本、什麼格式都能存，彈性很大。但問題是沒有 ACID 事務，並發寫入可能搞壞資料；沒有 Schema 驗證，資料品質難保證；還有個經典問題是小檔案越堆越多，查詢效能每隔一段時間就需要人工去整理。

Lakehouse 想做的事是：在同一個儲存層上，同時拿到兩者的優點。

具體怎麼做？核心想法是把一個「事務性的元資料層」疊加在開放格式的物件儲存上面。資料本身存在 S3 或 GCS 這類便宜的物件儲存裡，格式是 Parquet 這種開放格式，然後由 Apache Iceberg 或 Delta Lake 這樣的開放表格格式，在上面提供 ACID 語義、時間旅行、還有 Schema 演化的能力。

以 Apache Iceberg 為例，它的元資料有三層。最上層是 Metadata files，記錄表格的 Schema 和整個快照歷史；中間是 Manifest list，每個快照對應一個清單，記錄哪些資料檔屬於這個快照；最底層是 Manifest files，具體記錄每個 Parquet 檔案的路徑、行數，還有最大值最小值這些統計資訊。

查詢引擎在讀資料之前，會先走過這三層元資料，利用統計資訊大幅縮小需要掃描的資料量。這個設計還有個好處——不需要中央 catalog，就能讓不同引擎之間做原子切換。

Delta Lake 的設計思路類似，但更以 Spark 為核心。它在資料夾根目錄放一個 `_delta_log` 目錄，裡面用 JSON 格式一筆一筆記錄事務，每隔十個版本會產生一個 Parquet checkpoint 來加速載入。

說到 Iceberg 跟 Delta Lake 的選擇，這是 2025 年最常被問的問題之一。Iceberg 起源於 Netflix，後來捐給 Apache 基金會，設計重點是跨引擎互通，Snowflake、Trino、BigQuery、Flink 都支援得很好。Delta Lake 是 Databricks 主導的，後來進入 Linux 基金會，在 Spark 生態系的效能和易用性上有優勢，但主要還是綁在 Databricks 的軌道上。

值得注意的是，這兩個格式在 2025 年開始走向融合。Delta Lake 推出了一個叫 UniForm 的功能，讓 Delta 表格可以同時公開 Iceberg 相容的元資料。意思是，你用 Delta Lake 寫資料，但任何支援 Iceberg 的引擎都能讀取——實現了「寫一次、到處讀」的目標。

那實際選哪個？有個簡單的判斷框架：如果你主要在 Databricks 生態系統裡面工作，選 Delta Lake；如果你需要多個引擎同時存取同一份資料，比如 Snowflake 加 Spark 加 Trino 並用，選 Iceberg；如果兩個都想要，Delta UniForm 是現在最務實的方案。

最後總結三個核心重點。

第一，Lakehouse 不是一個軟體產品，是一種架構模式——用開放表格格式在便宜的物件儲存上加出 ACID 能力，同時保住彈性和低成本。

第二，Apache Iceberg 和 Delta Lake 是目前的兩條主流路徑，選擇主要取決於你的引擎生態系，而不是技術優劣。

第三，產業正在走向格式融合，UniForm 這類互通機制讓「選錯格式」的風險越來越小。2025 年如果你在設計新的資料系統，Lakehouse 架構應該是你的預設起點，而不是備選方案。

← Previous GitHub Trending Week 113: Warp Goes Open Source, Agent Skills Standard, Codex CLI GA

Next → Why Your AI Agent Gets Worse Over Time — Context Rot Explained

System Design Mock: Architecture Decisions for a Book E-Commerce Platform

For a book selling platform, the key decisions are search architecture (Elasticsearch vs full-text search), inventory consistency (strong vs eventual), and order state machine design.

#system-design #microservices #e-commerce #database #caching #api-design

tech

May 28, 2026

DDIA Chapter 1: Reliability, Scalability, Maintainability — Three Terms Engineers Use Wrong

DDIA Chapter 1's core argument: the challenge of data-intensive systems isn't big compute — it's data complexity (volume, variety, velocity). Evaluating this complexity requires precise definitions of reliability, scalability, and maintainability that are more specific than how most engineers use these terms.

#system-design #database #distributed-systems #software-engineering #book

learning

May 27, 2026

System Design Deep Dive: Designing Uber — From Requirements to Architecture Trade-offs

The hardest part of designing Uber isn't picking the right technologies — it's breaking a vague, enormous problem into discussable sub-problems

#system-design #architecture #uber #engineering #interview-prep