Context
ShopWave, a mid-sized e-commerce marketplace, currently lands order, payment, shipment, and customer data from PostgreSQL and SaaS tools into Snowflake using basic table copies. Analysts query raw replicated tables directly, which causes inconsistent business definitions for metrics like gross merchandise value, net revenue, and fulfilled orders.
You need to design a production-grade ELT pipeline and target data model that supports reliable analytics and downstream dashboards. The focus is not only moving data, but choosing an appropriate warehouse data model, handling slowly changing business entities, and ensuring data quality as source schemas evolve.
Scale Requirements
- Sources: PostgreSQL OLTP, Stripe, Shopify, Zendesk
- Volume: 120M order line items/year, 15M customers, 300GB new raw data/day
- Load cadence: CDC from PostgreSQL every 5 minutes; SaaS extracts hourly
- Latency target: Analytics-ready marts available within 15 minutes of source updates
- Retention: 3 years hot in Snowflake, 7 years archived in S3
- Concurrency: 150 BI users, 40 scheduled dashboard refreshes/hour
Requirements
- Design a warehouse data model for orders, customers, payments, and shipments that supports finance and operations reporting.
- Define how raw ingestion, staging, intermediate transformations, and marts should be separated.
- Handle late-arriving updates, canceled orders, refunds, and customer attribute changes.
- Implement idempotent incremental ELT with backfill support for historical reprocessing.
- Add data quality checks for primary keys, referential integrity, freshness, and metric reconciliation.
- Orchestrate dependencies so marts only run after upstream loads and validation succeed.
- Provide monitoring, alerting, and failure recovery for source outages and schema drift.
Constraints
- Existing stack is AWS + Snowflake; avoid introducing a large new streaming platform.
- Team size is 3 data engineers and 6 analysts.
- Budget allows moderate Snowflake growth, but transformations must remain cost-aware.
- PCI-sensitive payment fields must not be exposed in analytics models; only tokenized identifiers may be stored.