System Design Mock: Breaking Down the DoorDash Donation Feature

Table of Contents

The DoorDash donation feature system design question has appeared in interviews frequently enough to be documented across LeetCode Discuss and multiple interview sharing platforms. The premise seems simple — users can choose to donate at checkout — but the detail handling and scale discussion can go deep. This is a complete mock walkthrough with focus on the reasoning behind design decisions, not on memorizing the “right” answer.

TL;DR

The core problem of designing the DoorDash donation feature: with millions of users checking out simultaneously, each order potentially triggering a donation, how do you reliably record every donation, prevent double-counting, and provide an accurate (or near-accurate) real-time total? The answer is event-driven architecture + Redis counter + async reconciliation — not writing to the database and querying the count on every donation.

Design Philosophy

DoorDash’s real system is built on Apache Kafka event-driven architecture, with their core system called Iguazu processing hundreds of billions of events per day. This background explains why DoorDash’s system design thinking naturally gravitates toward event-driven: microservices are decoupled, communicating via Kafka topics, with downstream services subscribing to the event streams they need.

Core Concepts

Requirements Breakdown

Functional requirements:

Users can choose to donate at checkout (typically $1, $2, or custom amount)
Display cumulative donation total for the campaign (e.g., “Campaign total: $1,234,567”)
Notify users when donation succeeds
Backend can query donation statistics for specific time ranges

Non-functional requirements:

Donation records cannot be lost (financial transaction reliability)
Donations cannot be double-counted (user clicks donate once; system retries can’t record it twice)
Cumulative total can tolerate second-level latency (no need for strong-consistency real-time update)
Peak times (dinner hours): potentially tens of thousands of orders per second

Scale Estimation

In interviews, scale estimation isn’t about hitting exact numbers — it’s about confirming the design choices are appropriate for the right order of magnitude:

DoorDash daily order volume: approximately a few million (2024 data)
Assuming 30% of users donate: ~1M donations/day
Peak hour (3-hour dinner window) contains ~40% of orders: ~400K/3 hours ≈ 37/sec average, peak 3–5x ≈ 100–180/sec
This volume is manageable for a single Postgres database, but the read/write contention on the cumulative counter is the problem

System Architecture

graph TB
  subgraph "Checkout Flow"
    Client["Client\nCheckout + Donation Option"]
    OrderSvc["Order Service\nProcess Payment"]
    PaySvc["Payment Service\nCharge"]
  end

  subgraph "Donation Flow"
    DonSvc["Donation Service\nWrite Donation Record"]
    DonDB["Donation DB\nPostgres"]
    Kafka["Kafka\ndonate.created topic"]
  end

  subgraph "Aggregation Flow"
    Aggregator["Counter Aggregation Service\nConsume Kafka Events"]
    Redis["Redis\nDonation Counter"]
    Dashboard["Dashboard API\nRead Counter"]
  end

  subgraph "Notification Flow"
    NotifSvc["Notification Service\nSubscribe Kafka"]
    Push["Push Notification"]
  end

  Client --> OrderSvc
  OrderSvc --> PaySvc
  PaySvc -->|"Payment Success Callback"| DonSvc
  DonSvc --> DonDB
  DonSvc -->|"Publish Event"| Kafka
  Kafka --> Aggregator
  Aggregator --> Redis
  Redis --> Dashboard
  Kafka --> NotifSvc
  NotifSvc --> Push

Key Design Decisions

Decision 1: Strong Consistency for Donation Records

Donations are financial transactions. Requirements:

Record donation only on successful payment (can’t record first, charge later)
Donation record cannot be duplicated by system retries

Solution: Idempotency design

Use order_id + donation_attempt_id as a unique key per donation (or the payment system’s payment_reference_id):

CREATE TABLE donations (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  order_id TEXT NOT NULL,
  payment_ref TEXT NOT NULL UNIQUE,  -- prevent duplicate inserts
  amount_cents INTEGER NOT NULL,
  created_at TIMESTAMPTZ DEFAULT NOW()
);

The UNIQUE constraint on payment_ref means duplicate payment callback inserts fail silently (Postgres ON CONFLICT DO NOTHING), ensuring only one record exists per donation even if the payment callback fires multiple times.

Decision 2: Redis Counter for Aggregation, Not SQL COUNT(*)

If every query for the donation total runs SELECT SUM(amount_cents) FROM donations WHERE campaign_id = 'xxx', that’s a heavy query on millions of records. The solution is maintaining a Redis counter:

INCRBYFLOAT campaign:2026q2:total_cents 200

Each time a new donation event is consumed from Kafka, atomically accumulate with INCRBYFLOAT. Redis’s atomic operations ensure count consistency at O(1) performance.

Challenge: what happens when Redis restarts?

Answer: Redis AOF or RDB persistence reduces the loss window. But even if the counter is lost, you can recalculate from the database’s SUM and backfill. The design accepts “displayed total may have second-to-minute lag but won’t be permanently inconsistent.”

Decision 3: Kafka Consumer At-Least-Once Semantics

Kafka’s consumer guarantees at-least-once delivery (the same event may be consumed multiple times). This means the counter could be incremented multiple times.

Solution: track processed event IDs on the consumer side (stored in a Redis set or DB), idempotency check:

event_id = event.headers['donation_id']
if redis.sismember('processed_donations', event_id):
    return  # already processed, skip
redis.incrbyfloat(f'campaign:{campaign_id}:total', event.amount)
redis.sadd('processed_donations', event_id)
redis.expire('processed_donations', 86400 * 7)  # clean up after 7 days

Comparison with Common Alternatives

Approach	Pros	Cons
SQL COUNT/SUM on every query	Strong consistency, simple to implement	Heavy DB load under concurrency, slow
Redis counter (event-driven)	Fast, scalable	Need to handle duplicate consumption, eventual consistency
DB + materialized view	Strong consistency, SQL queryable	High refresh cost when updates are frequent
2PC (distributed transaction)	Strong consistency	High complexity, poor performance, easily a bottleneck

How to Expand in an Interview

The depth in this question comes from “what do you choose, why, and at what cost”:

Scale up: If the campaign gets 30 million donations, is a Redis counter still sufficient? (Yes — INCR is O(1), Redis handles millions of operations per second)
Fraud detection: Same user makes 1,000 donations in a short time — how do you handle it? (Rate limiting + anomaly detection added at the donation service layer)
Precise final count when campaign ends: You need an exact final number, not an estimate (after closing Kafka consumption, run SQL SUM as final confirmation, overwrite Redis counter with DB value)

The Bottom Line

The value of the DoorDash donation feature question isn’t in the answer — it’s that it covers the most common distributed systems design problems in one problem: idempotency, counter aggregation, eventual vs. strong consistency, event-driven decoupling. A good mock is: first clarify requirements and scale, then propose a few approaches and discuss trade-offs, rather than reciting a “standard answer.”

References

🇺🇸 English

Here's the pitch — a feature that looks dead simple from the outside turns into one of the most instructive distributed systems problems you'll encounter in an interview: the DoorDash checkout donation feature.

You've seen it. You're about to place your order, and there's a small prompt — want to round up and donate a dollar or two? Feels trivial. But now imagine millions of users hitting that button simultaneously during the dinner rush, every donation needing to be recorded exactly once, and somewhere on a campaign page, a live running total that needs to feel real-time. Suddenly this is anything but simple.

Let's break it down.

---

The first thing you want to do in any system design interview is get the requirements crisp. On the functional side: users should be able to donate at checkout, the system should show a cumulative campaign total — something like "Campaign total: one point two million dollars" — users should get a notification when their donation goes through, and the backend needs to support queries on donation stats over time.

On the non-functional side, here's where it gets interesting. Donation records cannot be lost — this is a financial transaction, full stop. Donations cannot be double-counted — if a user clicks donate once, the system should record it once, even if retries happen under the hood. The cumulative total, though? That can tolerate a few seconds of lag. Nobody's going to notice if the displayed total is two seconds behind reality. And at peak — dinner hours, maybe a three-hour window — you're looking at potentially hundreds of orders per second.

That last point is worth sitting with. DoorDash processes something in the range of a few million orders per day. Assume thirty percent of users donate — that's about a million donations a day. Concentrate forty percent of that into a three-hour dinner window, and you're looking at around a hundred to a hundred eighty donations per second at peak. That's honestly manageable for a single Postgres database. But here's the trap: the problem isn't writing individual donation records. It's maintaining an accurate running total under that kind of concurrent write pressure.

---

So let's talk architecture. DoorDash's real infrastructure is built around Apache Kafka — their internal event pipeline processes hundreds of billions of events per day. That context matters, because it explains why the natural answer here is event-driven.

Here's how the flow works in plain language. When a user checks out, the order service coordinates with the payment service to actually charge the card. Only after payment succeeds does the donation service get involved. That's critical — you never record a donation before the money clears. The donation service writes the record to a Postgres database, then publishes an event to a Kafka topic. From there, two things happen in parallel downstream: a counter aggregation service consumes those events and updates a Redis counter, and a notification service consumes the same events to send the user a push notification. The dashboard showing the live campaign total just reads from Redis.

That separation is the key insight. The heavy lifting — accumulating the total — is decoupled from the checkout flow entirely. The user's checkout experience doesn't wait on counter updates. It just fires an event and moves on.

---

Now let's talk about the three design decisions that make or break this system.

**First: making donation records bulletproof.**

The payment service might fire the same callback multiple times. Network hiccups, retries — this is normal distributed systems behavior. If you naively write to the database on every callback, you'll end up with duplicate donations. The fix is idempotency. You use the payment system's unique reference ID as a unique constraint in your donations table. If the same payment fires twice, the second insert simply fails silently with a "do nothing on conflict" — one record, guaranteed, no matter how many times the callback arrives.

**Second: Redis for the running total, not a database query.**

The naive approach is to query your donations table with a sum aggregation every time someone loads the campaign page. That works fine at small scale. At millions of records under heavy load, it falls apart fast — it's an expensive scan every single time. Instead, you maintain a Redis counter. Every time the Kafka consumer processes a new donation event, it increments the counter atomically. Redis's atomic increment operation runs in constant time regardless of how large the counter gets. The display layer just reads one key from Redis — done.

The obvious follow-up question: what happens if Redis goes down and you lose the counter? You rebuild it. Redis supports persistence modes that minimize the loss window, but even in a worst case, you can recalculate the true total from your Postgres database and backfill the Redis key. The system accepts eventual consistency on the display — the true record of truth is always the database.

**Third: handling Kafka's at-least-once delivery guarantee.**

Kafka guarantees your consumer will see every event — but it might see some events more than once. If your counter aggregation service crashes and restarts mid-batch, it'll re-consume events it already processed. Without protection, you'd double-count those donations in Redis. The fix is tracking which event IDs you've already processed. Before incrementing the counter, you check a Redis set of processed donation IDs. If the ID's already there, you skip it. If it's new, you increment and add the ID to the set. You clean up that set after a week or so to keep memory bounded.

---

Let me quickly walk through the alternatives and why they get ruled out.

You could skip Redis entirely and just run a sum query on every page load. Simple, strongly consistent — but it hammers the database under load. You could use a database materialized view that pre-aggregates the total — more consistent, but expensive to keep fresh when thousands of donations are hitting per minute. You could go full distributed transaction with two-phase commit to guarantee every write is atomic across services — technically airtight, but the complexity and performance cost make it a serious bottleneck at scale. Event-driven with Redis wins on the tradeoff matrix: fast, scalable, and resilient enough for the consistency requirements here.

---

If an interviewer wants to push deeper, here's where the conversation can go. What if the campaign goes viral and you get thirty million donations? Redis still works — an atomic increment is constant time, Redis handles millions of operations per second without breaking a sweat. What about fraud — someone making a thousand small donations rapidly? You add rate limiting and anomaly detection at the donation service layer before the record ever hits the database. What about the final official campaign total when the campaign closes — do you trust the Redis counter? No. You stop consuming new events, run a verified sum from the database, and use that as the canonical final number, overwriting the Redis key with the confirmed value.

---

So here's what to take away from this.

One: financial records and display counters have different consistency requirements — and your architecture should reflect that. Write to the database for the permanent record; use a fast cache for display. These don't have to be the same operation.

Two: idempotency isn't an afterthought. In any system where retries can happen — and they always can — you need a strategy for making repeated operations safe. Unique constraints and idempotency keys are your first line of defense.

Three: the DoorDash donation question is worth studying not because it's tricky, but because it packs idempotency, counter aggregation, eventual versus strong consistency, and event-driven decoupling into one compact problem. Master the reasoning here and you've got a mental model that transfers to a dozen other system design scenarios.

🇹🇼 中文

DoorDash 的捐贈活動系統設計，是面試裡一道看起來很簡單、但細節挖下去會讓人冒汗的題目。用戶結帳時可以選擇捐一塊、兩塊，畫面上顯示活動累積募了多少錢——這個功能，背後要解決的問題其實非常典型。

核心問題是三件事同時發生：高並發寫入、金融可靠性、加上一個要即時滾動更新的計數器。先把這三個維度想清楚，整個設計就有方向了。

先講規模。DoorDash 日訂單量數百萬筆，假設三成用戶選擇捐款，大概一百萬筆捐款一天。高峰晚餐時段集中四成訂單，換算下來平均每秒大概三、四十筆，峰值可以到一百多筆。這個量級其實不算誇張——單一 Postgres 資料庫是扛得住的。但問題不是寫入，而是「計數器」的讀寫爭用。

DoorDash 的真實架構是基於 Kafka 的事件驅動系統，他們有個叫 Iguazu 的核心平台，每天處理千億事件。這個背景很重要，因為它解釋了為什麼這道題的「標準解法」是事件驅動而不是同步寫入。

整個流程大概是這樣：用戶結帳，訂單服務處理支付，支付成功之後，回調觸發捐贈服務，把這筆捐款寫進捐贈資料庫，然後往 Kafka 發一個事件。下游有兩個消費者：一個是聚合服務，負責更新 Redis 裡的計數器；另一個是通知服務，負責推播告訴用戶「你的捐款成功了」。儀表板顯示的總金額，直接從 Redis 讀。

現在來講三個最關鍵的設計決策。

**第一個：捐款記錄的冪等性。**

捐款是金融交易，有一個鐵律：支付成功才記錄，而且絕對不能重複計算。支付回調可能因為網路問題被觸發多次，這是正常的。解法是在捐款資料表上，對「支付參考 ID」這個欄位加唯一約束。重複的插入操作會靜默失敗，但資料庫裡只會有一筆記錄。這個設計不複雜，但很重要——它把重複處理的問題在資料層就擋掉了。

**第二個：計數器用 Redis，不用 SQL。**

如果每次有人查看捐贈總計，都去跑一次對捐款表的全表加總，在百萬筆記錄規模下，這個查詢很重，而且捐贈頁面的訪問量可能遠高於捐款頻率。Redis 的 `INCR` 操作是原子性的，時間複雜度 O(1)，每秒可以處理數百萬次。每次 Kafka 消費到新的捐贈事件，就原子性地累加計數器，儀表板直接讀這個值。

有人會問：Redis 重啟怎麼辦？答案是：開 AOF 持久化減少丟失窗口，即使真的丟了，可以從資料庫重算回填。設計上接受「顯示的總計有秒級延遲」，但不接受永久不一致。

**第三個：Kafka 消費的重複處理。**

Kafka 保證 at-least-once，也就是說同一個事件可能被消費兩次。如果每次消費都直接加到計數器上，計數器就會被重複增加。解法是在消費端維護一個已處理事件的集合——也存在 Redis 裡——每次消費前先檢查這個事件 ID 有沒有處理過，沒有才加，然後把 ID 記進集合。這個集合設個七天過期就夠了，老的記錄不需要永久保留。

和幾個替代方案比較一下。直接跑 SQL 加總的好處是強一致性、實作簡單，但高並發下資料庫壓力大、查詢慢。用物化視圖可以緩解查詢壓力，但頻繁更新時刷新代價高。兩階段提交能做到強一致性，但複雜度高、效能差，很容易變成瓶頸。事件驅動加 Redis counter 的代價是需要處理重複消費，但換來的是可擴展性和效能。

面試時，這道題還有幾個值得主動展開的方向：活動結束時需要精確的最終數字，怎麼做？關閉 Kafka 消費後，跑一次 SQL 加總作為最終確認，用資料庫值覆蓋 Redis counter。如果活動爆炸性成長到三千萬筆捐款，Redis counter 還夠嗎？夠的，INCR 是 O(1)，Redis 的瓶頸不在這裡。有人同一個帳號短時間捐款一千次，怎麼辦？在捐贈服務層加 rate limiting，這是欺詐偵測的範疇。

這道題的價值就在這裡。它把分散式系統最常考的幾個問題都串在一起：冪等性設計、計數器聚合、最終一致性的取捨，還有事件驅動解耦。面試的重點從來不是背答案，而是能不能說清楚：我為什麼選這個方案、它的代價是什麼、在哪些條件下我會換個選擇。

整理三個核心帶走：第一，金融操作的防重複，在資料層用唯一約束來保證，而不是靠應用層邏輯去擋。第二，高頻讀的計數器，不要打資料庫，Redis 原子操作是正確工具。第三，事件驅動架構天然適合這種場景，但 at-least-once 的語意需要消費端自己處理冪等性。

← Previous When Panic Hits: 5 Steps to Get Through It

Next → Is System Design Interview Just Rote Memorization?

DDIA Chapter 1: Reliability, Scalability, Maintainability — Three Terms Engineers Use Wrong

DDIA Chapter 1's core argument: the challenge of data-intensive systems isn't big compute — it's data complexity (volume, variety, velocity). Evaluating this complexity requires precise definitions of reliability, scalability, and maintainability that are more specific than how most engineers use these terms.

#system-design #database #distributed-systems #software-engineering #book

tech

April 30, 2026

Is System Design Interview Just Rote Memorization?

The point of system design interviews isn't memorizing answers — it's demonstrating that you can derive design decisions from first principles. Knowing Kafka, Redis, and consistent hashing cold doesn't help; explaining 'why this approach in this context, and what it costs' is what actually matters.

#system-design #interview #software-engineering #career