Context
Northstar Retail receives customer data from Salesforce, Shopify, Zendesk, and a legacy ERP. Today, nightly Airflow jobs land raw files in S3 and dbt models load Snowflake, but analysts frequently see conflicting values for the same customer (email, address, loyalty tier, and marketing consent), causing broken downstream reporting and CRM syncs.
You need to design a pipeline that consolidates these sources into a single trusted customer dimension with deterministic conflict resolution, auditability, and reprocessing support.
Scale Requirements
- Sources: 4 primary systems, plus 1 manual CSV override feed from operations
- Volume: 45M customer records total, 8M daily changed rows, 250 GB/day raw ingest
- Batch cadence: Every 15 minutes for API sources; nightly full snapshot from ERP
- Latency target: Source update to curated table in Snowflake within 20 minutes
- Retention: 2 years raw history, 7 years audit trail for compliance
Requirements
- Ingest incremental extracts from APIs and files into a raw zone without losing source-specific fields.
- Standardize schemas and map records to a canonical customer model.
- Detect conflicting attributes across sources for the same
customer_id or matched identity keys.
- Implement conflict resolution rules, such as source priority, latest-update wins, and field-level trust scores.
- Preserve lineage showing which source supplied each final field value and why it was selected.
- Support idempotent reruns, backfills, and replay of historical snapshots when rules change.
- Produce a curated
dim_customer table and a customer_conflict_audit table for analysts and operations.
- Add data quality checks for null spikes, duplicate identities, invalid emails, and consent inconsistencies.
Constraints
- Existing stack is AWS, Snowflake, dbt, and Airflow; avoid introducing more than one major new platform.
- Budget increase is capped at $15K/month.
- PII must be encrypted at rest and masked in non-production environments.
- Business users need field-level explainability for every resolved record.
- The ERP source can arrive up to 6 hours late and occasionally republishes duplicate files.