Context
NimbusCRM manages a global contact database used by sales, marketing, and support. New records arrive continuously from Salesforce, HubSpot, web forms, CSV partner uploads, and internal product events, but the current nightly SQL jobs only standardize a few fields and allow duplicates, stale records, and conflicting updates to accumulate.
You are asked to design a pipeline that keeps the contact master table clean over time while supporting both near-real-time ingestion for operational systems and batch reconciliation for historical corrections.
Scale Requirements
- Sources: 12 upstream systems, including 4 streaming and 8 batch/file-based
- Throughput: 8K events/sec peak, 1.5K avg for streaming updates
- Batch volume: 40M contact change records/day, plus weekly backfills up to 1.2B rows
- Master dataset: 220M unique contacts, ~3.5TB curated storage
- Latency target: < 3 minutes from source update to cleaned contact availability
- Retention: Raw immutable history for 2 years; curated master retained indefinitely
Requirements
- Ingest contact create/update/delete events from APIs, CDC streams, and daily CSV drops.
- Standardize fields such as email, phone, address, country, and name casing.
- Detect duplicates across sources using deterministic and probabilistic matching rules.
- Merge records into a canonical contact profile with survivorship logic and source precedence.
- Preserve full lineage, change history, and replay capability for backfills and rule changes.
- Ensure idempotent processing for retried files, duplicate events, and late-arriving updates.
- Expose cleaned data to Snowflake for analytics and to a low-latency serving table for downstream applications.
- Define monitoring, alerting, and operational recovery procedures.
Constraints
- Infrastructure must stay on AWS and use managed services where possible.
- Team size is 5 data engineers; operational complexity should be moderate.
- Budget target is <$35K/month incremental cloud cost.
- Compliance requirements include GDPR/CCPA deletion support and auditability of merge decisions.
- Some source systems provide unreliable primary keys and inconsistent schemas.