Context
LedgerLoop, a B2B SaaS billing platform, ingests customer and subscription data from Salesforce, Stripe, and an internal PostgreSQL application database. Today, analysts manually reconcile conflicting values such as email, billing status, plan tier, and account owner, causing inconsistent reporting and downstream billing errors.
You need to design a pipeline that consolidates these sources into a trusted customer master table while preserving lineage, handling late-arriving updates, and making conflict resolution rules transparent and auditable.
Scale Requirements
- Sources: Salesforce CDC, Stripe API extracts, PostgreSQL WAL-based CDC
- Volume: 25M customer records, 120M subscription events/day
- Update rate: 8K record changes/second peak, 1.5K average
- Latency target: < 10 minutes from source update to warehouse availability
- Retention: Raw history for 1 year, curated master tables for 7 years
- Storage: ~12 TB raw/year, ~4 TB curated/year
Requirements
- Ingest data incrementally from all three systems without full reloads.
- Standardize schemas and map source-specific fields into a canonical customer model.
- Resolve conflicting values using deterministic business rules, such as source priority, recency, and field-level trust scores.
- Preserve full change history, source lineage, and the winning rule used for each resolved field.
- Support idempotent reprocessing, backfills, and replay of historical source changes.
- Publish both raw bronze tables and a reconciled gold
customer_master table for analytics and operational reporting.
- Add automated data quality checks for null keys, duplicate customer IDs, invalid statuses, and abnormal conflict rates.
Constraints
- Existing stack is AWS + Snowflake; avoid introducing more than one new major platform.
- Team size is 3 data engineers; operational complexity should stay moderate.
- PII must be encrypted at rest and access-controlled; audit logs are required.
- Budget allows one medium-sized streaming cluster or a batch-first design with micro-batches every 5 minutes.