Context
Acme Retail syncs order, customer, and inventory data between its PostgreSQL transactional database, Salesforce, and Snowflake for analytics and downstream operations. Today, the company relies on ad hoc cron jobs and manual CSV transfers, causing stale data, duplicate records, and inconsistent business metrics across systems.
You are asked to design a pipeline architecture that ensures data flows seamlessly between systems with reliable ingestion, transformation, validation, and delivery.
Scale Requirements
- Sources: PostgreSQL OLTP, Salesforce API, internal order events
- Throughput: 15K change events/minute peak, 2K/minute average
- Batch volume: 120M order rows, 25M customer rows, 3 years of history
- Latency target: operational syncs < 2 minutes, analytics availability < 10 minutes
- Storage: ~4 TB raw historical data, growing 300 GB/month
- Availability target: 99.9% successful daily pipeline runs
Requirements
- Ingest incremental changes from PostgreSQL and Salesforce without full reloads.
- Support both near-real-time operational syncs and scheduled ELT into Snowflake.
- Enforce schema validation, deduplication, and idempotent writes across retries.
- Track lineage and job dependencies so downstream tables are updated in the correct order.
- Handle late-arriving or out-of-order events for order status updates.
- Provide observability for freshness, volume anomalies, and failed loads.
- Support backfills for a 90-day historical correction without disrupting live traffic.
Constraints
- Infrastructure must run on AWS using managed services where practical.
- Team size is 3 data engineers; operational complexity should stay moderate.
- Budget for incremental infrastructure is capped at $18K/month.
- Customer PII must be encrypted in transit and at rest; access must be auditable.
- Salesforce API has rate limits and occasional partial failures.
Your answer should describe the end-to-end design, data contracts, orchestration strategy, quality controls, and how you would guarantee consistent delivery between systems under retries and failures.