Context
AcmeCloud, a B2B SaaS company, needs a unified customer data pipeline across Salesforce (CRM), Stripe (billing), and Zendesk (support). Today, each system is queried independently and analysts manually reconcile account IDs, causing inconsistent revenue, churn, and support-health reporting.
You need to design a production-grade pipeline that integrates these disparate systems into a single analytics model in Snowflake. The company wants both scheduled batch ingestion for full-system consistency and near-real-time updates for key customer events.
Scale Requirements
- Sources: Salesforce (~8M accounts/contacts records), Stripe (~120M charge/invoice/payment events), Zendesk (~45M tickets/comments)
- Change volume: ~2M source updates/day, with 150 events/sec peak from webhooks
- Latency targets: <15 minutes for webhook-driven updates, <4 hours for daily full reconciliation
- Storage: ~12 TB raw historical data, growing 400 GB/month
- Consumers: 60 BI users, 8 downstream data products, 3 finance dashboards
Requirements
- Ingest data from REST APIs, bulk exports, and webhooks from all three systems.
- Normalize disparate schemas and map records to a canonical
customer_id despite missing or conflicting identifiers.
- Support both incremental loads and backfills without creating duplicates.
- Build analytics-ready tables for customer 360, MRR, failed payments, open support escalations, and churn risk inputs.
- Implement data quality checks for schema drift, null keys, duplicate events, and referential integrity.
- Orchestrate dependencies so raw ingestion, staging transforms, and marts run reliably and can be replayed.
- Expose lineage, freshness, and load status for stakeholders.
Constraints
- AWS is the required cloud; Snowflake is already the warehouse.
- Team size is 3 data engineers; operational complexity should stay moderate.
- Budget for new infrastructure is limited to $18K/month.
- Compliance: PII must be encrypted at rest, and deletion requests must propagate within 72 hours.