You are designing a real-time sales analytics pipeline for a large commerce platform that serves millions of merchants. Today, merchant dashboards and internal finance reports rely on hourly and daily jobs, which creates visible mismatches during flash sales, payment retries, refunds, and order edits. Leadership wants sales metrics to update within minutes while preserving correctness for financial reporting and merchant-facing dashboards. The main pain point is that the same order can change state multiple times, and current pipelines double-count or miss adjustments under peak load.
| Component | Status |
|---|---|
| Event Sources | Checkout, orders, payments, refunds, and app webhooks emitted as JSON events |
| Ingestion | Kafka topics for operational events; some legacy hourly object storage drops |
| Processing | Batch Spark jobs plus limited stream consumers for operational alerts |
| Storage | Data lake in Parquet and warehouse marts in Snowflake |
| Orchestration | Apache Airflow 2.x for batch DAGs and dbt runs |
Scale: 8M+ merchants, 150K events/sec average and 600K peak during major sales periods, 1-3 KB/event, multi-region writes, <2 minute freshness target for dashboard aggregates, and 7-year retention for financial facts.
How would you design the end-to-end pipeline so merchant sales analytics remain near real time, financially correct, and resilient to duplicates, late events, replays, and schema evolution at this scale? Explain the architecture, storage model, processing strategy, and operational approach you would use.