You are inheriting a reporting pipeline for a B2B SaaS platform that has expanded across AWS, Azure, and GCP. Product, finance, and support teams rely on shared dashboards, but monthly cloud billing exports, application events, and CRM data arrive in different formats and on different schedules, causing frequent mismatches in executive reports. Leadership wants a simpler cloud-native pipeline centered on Snowflake with clearer ownership, faster onboarding of new sources, and stronger data quality controls. The immediate trigger is an escalation after finance closed the quarter with revenue and usage numbers that did not reconcile.
| Component | Status / Technology |
|---|---|
| Source systems | AWS CUR in S3, Azure cost exports in Blob Storage, GCP billing exports in BigQuery, CRM via REST API |
| Ingestion | Python scripts on cron, ad hoc file drops, manual retries |
| Processing | Mixed pandas jobs and SQL scripts |
| Storage | Snowflake with loosely governed raw and reporting tables |
| Orchestration | Basic Apache Airflow 2.x DAGs |
Scale: ~2 TB/day landed data, 40+ source feeds, 15-minute freshness target for usage data, daily close-critical finance datasets, 3 years of retained history.
How would you redesign this pipeline around Snowflake-native services so cloud data from multiple providers is ingested, transformed, and reconciled consistently, while keeping the system easy to operate and resilient to late files, schema drift, and backfills?