Context
Northstar Retail receives customer and order data from its e-commerce app, CRM, and payment platform. Today, analysts manually clean CSV extracts before loading them into Snowflake, causing inconsistent metrics, duplicate records, and 1-day reporting delays.
You need to design a production-grade batch pipeline that standardizes, validates, and prepares this data for downstream analysis. The company wants a repeatable process that improves data quality while remaining simple enough for a small data team to operate.
Scale Requirements
- Sources: Shopify exports, Salesforce CRM dumps, Stripe transaction files, and internal PostgreSQL tables
- Volume: ~120 GB/day raw data, ~45 million records/day across all sources
- Batch frequency: Hourly ingestion, daily backfill support for up to 2 years
- Latency target: Data available in analytics tables within 30 minutes of file arrival
- Retention: Raw data for 1 year, curated tables for 5 years
- Data quality target: <0.5% invalid records in curated tables
Requirements
- Design an ingestion and cleaning pipeline for structured CSV/JSON data from multiple systems.
- Standardize schemas, data types, timestamps, currencies, and null handling across sources.
- Detect and handle duplicates, malformed rows, missing required fields, and referential integrity issues.
- Create raw, cleaned, and analytics-ready layers in Snowflake.
- Support idempotent reruns and historical backfills without creating duplicate outputs.
- Orchestrate hourly loads, dependency management, and quality checks.
- Provide monitoring for freshness, row-count anomalies, and failed validations.
Constraints
- Team: 3 data engineers, no dedicated platform team
- Infrastructure: AWS already in use; Snowflake is the analytics warehouse
- Budget: Prefer managed services over large always-on clusters
- Compliance: PII includes email, phone, and billing address; access must be auditable
- Operational requirement: Analysts must be able to trace every curated record back to its raw source file