Context
Northstar Retail aggregates operational data from 4,200 stores and 180 retail partners across North America and Europe. Today, ERP exports, retailer POS files, and regional inventory feeds land in S3 as inconsistent CSV/JSON files, and analysts spend hours reconciling region, retailer, and calendar definitions before reporting.
You need to design a pipeline and data model that makes operational data easy to analyze by region, retailer, and time period while supporting both daily executive dashboards and ad hoc warehouse queries.
Scale Requirements
- Sources: 35 ERP tables, 180 retailer POS feeds, 12 regional inventory systems
- Volume: ~220M operational records/day, ~450 GB raw/day
- Ingestion cadence: hourly for POS/inventory, daily for ERP snapshots
- Latency target: data queryable in Snowflake within 30 minutes of file arrival
- Retention: 2 years hot in warehouse, 7 years archived in S3
- Concurrency: 250 BI users, 40 scheduled dashboard refreshes/hour
Requirements
- Design an ingestion and ELT pipeline to load raw operational data from S3 into Snowflake.
- Define a dimensional model or equivalent analytics model that supports slicing by region, retailer, and day/week/month/quarter.
- Standardize source-specific codes into conformed dimensions (e.g., retailer IDs, regional hierarchies, fiscal calendar).
- Support late-arriving corrections, reprocessing, and historical backfills without duplicating facts.
- Implement data quality checks for schema drift, null business keys, duplicate records, and invalid region-retailer mappings.
- Orchestrate hourly and daily dependencies with clear lineage and recovery steps.
- Expose analytics-ready tables for Tableau/Looker with predictable query performance.
Constraints
- AWS is the required cloud; existing landing zone is Amazon S3.
- Team size is 3 data engineers and 1 analytics engineer; operational simplicity matters.
- Budget target is under $30K/month incremental spend.
- Must support SOX-style auditability: immutable raw zone, transformation lineage, and reproducible historical loads.