You are designing a change data capture pipeline for an e-commerce platform that needs operational data from MySQL available in its warehouse with sub-second latency. Today, analysts and finance teams see inventory, orders, and refunds several minutes late, and recent executive escalations exposed mismatches between application state and reporting tables during peak traffic. The platform wants a streaming-first design that preserves row-level changes, supports schema evolution, and can recover cleanly from failures without double-applying updates.
| Component | Status / Technology |
|---|---|
| Source OLTP | MySQL 8.0 primary + read replicas |
| CDC Capture | Nightly dumps plus hourly incremental queries |
| Transport | No durable event bus |
| Warehouse | Snowflake with raw and modeled layers |
| Transformations | dbt scheduled jobs |
| Orchestration | Shopify Flow for operational triggers; cron-based data jobs |
Scale: ~25K row changes/sec peak, ~4K avg across orders, order_lines, inventory_levels, refunds, and customers; 1.5 TB source data; 300+ tables, 40 business-critical; target freshness P95 < 1 second to raw warehouse tables; 7-day replay window.
How would you architect this CDC pipeline end to end so that MySQL changes land in warehouse tables with sub-second latency while preserving ordering, idempotency, and correctness under schema changes, backfills, and partial outages? Explain the design choices you would make for capture, transport, processing, warehouse loading, and operational monitoring.