Context
Northstar Retail operates an online marketplace and currently loads application, order, and customer-support data into Snowflake using ad hoc Python ETL jobs. Reporting teams are seeing inconsistent metrics because source-aligned tables are not modeled consistently, and the company wants a standardized ELT pipeline with clear dimensional models for finance, product, and growth analytics.
You are asked to design a data pipeline and warehouse modeling approach that supports both operational source ingestion and analytics-ready data marts.
Scale Requirements
- Sources: PostgreSQL OLTP, Stripe, Zendesk, and S3 CSV partner feeds
- Volume: 250M order line records, 40M customers, 1.2B clickstream events/year
- Daily ingest: ~800 GB/day raw data
- Freshness: Core business tables available in Snowflake within 15 minutes; clickstream aggregates within 5 minutes
- Retention: 3 years queryable history, 7 years archived raw data
- Concurrency: 150 BI users, 30 scheduled dashboard refreshes/hour
Requirements
- Design an ELT architecture that lands raw data, standardizes schemas, and builds analytics models in Snowflake.
- Explain which data modeling techniques you would use (for example: star schema, slowly changing dimensions, fact tables, snapshots, data vault, or normalized staging) and where each fits.
- Build trusted models for key entities:
customers, orders, order_items, products, and support_tickets.
- Support incremental loads, late-arriving updates, and backfills without duplicating records.
- Define data quality checks for primary keys, referential integrity, freshness, and metric consistency.
- Orchestrate dependencies so raw ingestion, staging, and marts run reliably with clear lineage.
Constraints
- Existing stack is AWS + Snowflake; avoid introducing more than one major new platform.
- Team size is 3 data engineers and 2 analytics engineers.
- PCI-related payment fields must be masked before analyst access.
- Monthly incremental infrastructure budget is capped at $18K.
- The design must be understandable by analysts who will extend dbt models.