Context
MediCart, a mid-sized e-commerce marketplace, loads order, payment, inventory, and shipment data from PostgreSQL, Stripe webhooks, and CSV files from 40 logistics partners into Snowflake. The current Airflow-based nightly ETL frequently produces incomplete fact tables because upstream files arrive late, webhook events are duplicated, and some source fields are null or missing.
The analytics team needs a redesigned pipeline that can detect missing or incomplete data early, prevent bad loads from reaching reporting tables, and support safe backfills without double-counting revenue or orders.
Scale Requirements
- Batch volume: 120M order-related records/day
- File ingestion: 8,000 partner CSV files/day, 5-500 MB each
- Streaming side input: 15K payment/shipment events/sec peak
- Latency target: curated tables available by 6:00 AM UTC; critical payment completeness checks within 10 minutes
- Storage: 60 TB historical raw data retained for 1 year
Requirements
- Design an ingestion and transformation pipeline that combines batch files, CDC from PostgreSQL, and event streams into Snowflake.
- Detect missing records, null critical fields, late-arriving files, and partial loads before publishing curated tables.
- Define how to quarantine bad data, replay failed partitions, and backfill missing days safely.
- Ensure idempotent processing for duplicate webhook events and repeated file deliveries.
- Implement data quality checks for row-count reconciliation, freshness, schema validation, and referential integrity.
- Describe monitoring, alerting, and operational ownership for incomplete-data incidents.
Constraints
- Existing stack is AWS, Snowflake, Airflow, and dbt; avoid introducing more than one major new platform.
- Budget increase is capped at $15K/month.
- SOX compliance requires auditable lineage and reproducible reruns.
- Downstream finance dashboards must not show partially complete daily revenue tables.
- Team size: 3 data engineers and 1 analytics engineer.