Build Real-Time Sales Analytics Pipeline

Scenario

You are designing a real-time sales analytics pipeline for a large commerce platform that serves millions of merchants. Today, merchant dashboards and internal finance reports rely on hourly and daily jobs, which creates visible mismatches during flash sales, payment retries, refunds, and order edits. Leadership wants sales metrics to update within minutes while preserving correctness for financial reporting and merchant-facing dashboards. The main pain point is that the same order can change state multiple times, and current pipelines double-count or miss adjustments under peak load.

Current State

Component	Status
Event Sources	Checkout, orders, payments, refunds, and app webhooks emitted as JSON events
Ingestion	Kafka topics for operational events; some legacy hourly object storage drops
Processing	Batch Spark jobs plus limited stream consumers for operational alerts
Storage	Data lake in Parquet and warehouse marts in Snowflake
Orchestration	Apache Airflow 2.x for batch DAGs and dbt runs

Scale: 8M+ merchants, 150K events/sec average and 600K peak during major sales periods, 1-3 KB/event, multi-region writes, <2 minute freshness target for dashboard aggregates, and 7-year retention for financial facts.

Question

How would you design the end-to-end pipeline so merchant sales analytics remain near real time, financially correct, and resilient to duplicates, late events, replays, and schema evolution at this scale? Explain the architecture, storage model, processing strategy, and operational approach you would use.

Scenario

Current State

Component	Status
Event Sources	Checkout, orders, payments, refunds, and app webhooks emitted as JSON events
Ingestion	Kafka topics for operational events; some legacy hourly object storage drops
Processing	Batch Spark jobs plus limited stream consumers for operational alerts
Storage	Data lake in Parquet and warehouse marts in Snowflake
Orchestration	Apache Airflow 2.x for batch DAGs and dbt runs

Question

Scenario

Current State

Component	Status
Event Sources	Checkout, orders, payments, refunds, and app webhooks emitted as JSON events
Ingestion	Kafka topics for operational events; some legacy hourly object storage drops
Processing	Batch Spark jobs plus limited stream consumers for operational alerts
Storage	Data lake in Parquet and warehouse marts in Snowflake
Orchestration	Apache Airflow 2.x for batch DAGs and dbt runs

Question

Scenario

Current State

Component	Status
Event Sources	Checkout, orders, payments, refunds, and app webhooks emitted as JSON events
Ingestion	Kafka topics for operational events; some legacy hourly object storage drops
Processing	Batch Spark jobs plus limited stream consumers for operational alerts
Storage	Data lake in Parquet and warehouse marts in Snowflake
Orchestration	Apache Airflow 2.x for batch DAGs and dbt runs

Interview Guides

Scenario

Current State

Question

Build Real-Time Sales Analytics Pipeline

Scenario

Current State

Question

Your Answer

Build Real-Time Sales Analytics Pipeline

Scenario

Current State

Question

Build Real-Time Sales Analytics Pipeline

Scenario

Current State

Question

Your Answer