Context
Northstar Retail ingests customer profile data from its CRM (Salesforce), e-commerce platform (Shopify), and support system (Zendesk). Today, nightly batch jobs load each source independently into Snowflake, causing conflicting values for fields like email, phone, address, and loyalty status; analysts and downstream services do not know which source to trust.
You need to design a pipeline that consolidates these records into a single golden customer table while preserving lineage, handling source conflicts deterministically, and supporting both daily batch loads and near-real-time updates for critical fields.
Scale Requirements
- Sources: 3 primary systems, with 2 more planned within 6 months
- Volume: 45M customer records total, ~8M daily changed rows, ~2K CDC events/sec peak
- Latency: <15 minutes for CDC-driven updates, <2 hours for full daily reconciliation
- Storage: 12 TB raw historical data retained for 1 year; curated warehouse tables retained indefinitely
- Availability: 99.9% successful pipeline runs per month
Requirements
- Ingest batch extracts and CDC streams from all source systems into a raw layer with immutable history.
- Standardize schemas, normalize fields (email casing, phone formatting, address parsing), and assign a canonical customer key.
- Implement conflict resolution rules, such as source priority, recency, field-level trust scores, and survivorship logic.
- Preserve auditability: every golden record field must be traceable to its source record and resolution rule.
- Ensure idempotent reprocessing for retries, backfills, and late-arriving updates.
- Publish a curated
golden_customer table for analytics and operational consumers.
- Define monitoring, alerting, and failure recovery for data quality and freshness.
Constraints
- Existing stack is AWS + Snowflake; avoid introducing more than one new stateful platform.
- Team size is 3 data engineers; operational complexity must stay low.
- PII must be encrypted at rest and in transit; access must be role-based.
- Budget target is <$18K/month incremental infrastructure cost.