Modernize Multi-Cloud ELT Reporting Pipeline

Scenario

You are inheriting a reporting pipeline for a B2B SaaS platform that has expanded across AWS, Azure, and GCP. Product, finance, and support teams rely on shared dashboards, but monthly cloud billing exports, application events, and CRM data arrive in different formats and on different schedules, causing frequent mismatches in executive reports. Leadership wants a simpler cloud-native pipeline centered on Snowflake with clearer ownership, faster onboarding of new sources, and stronger data quality controls. The immediate trigger is an escalation after finance closed the quarter with revenue and usage numbers that did not reconcile.

Current State

Component	Status / Technology
Source systems	AWS CUR in S3, Azure cost exports in Blob Storage, GCP billing exports in BigQuery, CRM via REST API
Ingestion	Python scripts on cron, ad hoc file drops, manual retries
Processing	Mixed pandas jobs and SQL scripts
Storage	Snowflake with loosely governed raw and reporting tables
Orchestration	Basic Apache Airflow 2.x DAGs

Scale: ~2 TB/day landed data, 40+ source feeds, 15-minute freshness target for usage data, daily close-critical finance datasets, 3 years of retained history.

Question

How would you redesign this pipeline around Snowflake-native services so cloud data from multiple providers is ingested, transformed, and reconciled consistently, while keeping the system easy to operate and resilient to late files, schema drift, and backfills?

Scenario

Current State

Component	Status / Technology
Source systems	AWS CUR in S3, Azure cost exports in Blob Storage, GCP billing exports in BigQuery, CRM via REST API
Ingestion	Python scripts on cron, ad hoc file drops, manual retries
Processing	Mixed pandas jobs and SQL scripts
Storage	Snowflake with loosely governed raw and reporting tables
Orchestration	Basic Apache Airflow 2.x DAGs

Scale: ~2 TB/day landed data, 40+ source feeds, 15-minute freshness target for usage data, daily close-critical finance datasets, 3 years of retained history.

Scenario

Current State

Component	Status / Technology
Source systems	AWS CUR in S3, Azure cost exports in Blob Storage, GCP billing exports in BigQuery, CRM via REST API
Ingestion	Python scripts on cron, ad hoc file drops, manual retries
Processing	Mixed pandas jobs and SQL scripts
Storage	Snowflake with loosely governed raw and reporting tables
Orchestration	Basic Apache Airflow 2.x DAGs

Scale: ~2 TB/day landed data, 40+ source feeds, 15-minute freshness target for usage data, daily close-critical finance datasets, 3 years of retained history.

Scenario

Current State

Component	Status / Technology
Source systems	AWS CUR in S3, Azure cost exports in Blob Storage, GCP billing exports in BigQuery, CRM via REST API
Ingestion	Python scripts on cron, ad hoc file drops, manual retries
Processing	Mixed pandas jobs and SQL scripts
Storage	Snowflake with loosely governed raw and reporting tables
Orchestration	Basic Apache Airflow 2.x DAGs

Scale: ~2 TB/day landed data, 40+ source feeds, 15-minute freshness target for usage data, daily close-critical finance datasets, 3 years of retained history.

Interview Guides

Scenario

Current State

Question

Modernize Multi-Cloud ELT Reporting Pipeline

Scenario

Current State

Question

Your Answer

Modernize Multi-Cloud ELT Reporting Pipeline

Scenario

Current State

Question

Modernize Multi-Cloud ELT Reporting Pipeline

Scenario

Current State

Question

Your Answer