Context
Northstar Retail runs an AWS-based analytics stack for e-commerce orders, payments, and customer support data. Today, analysts manually join raw tables in Snowflake because the current nightly ETL only copies source data without consistent modeling, causing duplicate metrics, broken dashboards, and slow backfills.
You are asked to design a maintainable data modeling and ETL solution that produces trusted analytics tables for finance, operations, and product teams.
Scale Requirements
- Sources: PostgreSQL orders DB, Stripe payment exports, Zendesk ticket API, and S3 CSV partner files
- Volume: 120M order records, 15M customers, 400GB new raw data/day
- Batch frequency: Hourly incremental loads for operational reporting; nightly full quality reconciliation
- Latency target: Source to analytics-ready tables in < 45 minutes for hourly runs
- Retention: 3 years hot storage in Snowflake, 7 years archived in S3
Requirements
- Design raw, staging, and curated layers for the warehouse, including fact and dimension models for orders, customers, payments, and support interactions.
- Build incremental ETL/ELT logic that handles inserts, updates, late-arriving records, and deduplication.
- Define surrogate keys, slowly changing dimensions where appropriate, and clear grain for each table.
- Orchestrate dependencies so source ingestion, transformations, and data quality checks run reliably every hour.
- Include data validation for schema drift, null business keys, duplicate order IDs, and reconciliation of order totals vs payment totals.
- Support backfills for a 90-day historical correction without breaking downstream dashboards.
- Provide monitoring, alerting, and failure recovery for pipeline and data quality issues.
Constraints
- Existing stack must remain on AWS + Snowflake; no new self-managed clusters
- Team size is 3 data engineers; solution should favor operational simplicity
- Budget increase capped at $15K/month
- Compliance: PII must be masked in non-production environments and access audited
- Downstream BI tools expect stable table names and backward-compatible schema evolution