Design Cloud ETL Migration Pipeline

Context

FinEdge, a mid-sized fintech company, currently runs on-premise SQL Server Integration Services (SSIS) jobs to move transactional, customer, and billing data into a reporting database once per day. The company is migrating to the cloud and wants a modern pipeline on either AWS or Azure that improves reliability, supports incremental processing, and reduces the current 10-hour reporting delay.

You are asked to design a cloud-native batch ETL platform that ingests data from operational databases, object storage, and SaaS APIs into an analytics warehouse.

Scale Requirements

Sources: 12 PostgreSQL/MySQL databases, 4 SaaS APIs, and daily CSV drops
Volume: 2.5 TB/day raw data, ~18 billion rows total historical backfill
Batch frequency: Hourly incremental loads for critical tables; daily full loads for low-priority sources
Latency target: Critical datasets queryable within 30 minutes of source update
Retention: 1 year raw zone, 5 years curated warehouse data

Requirements

Design a cloud architecture on AWS or Azure for ingesting relational, file-based, and API data sources.
Support both historical backfills and incremental loads using CDC timestamps or change tracking.
Implement transformation layers for raw, cleaned, and curated datasets.
Orchestrate dependencies between ingestion, validation, transformation, and warehouse loading jobs.
Define how you would enforce idempotency, schema evolution handling, and data quality checks.
Provide monitoring for job failures, SLA breaches, row-count anomalies, and cost spikes.
Explain how analytics users will access trusted tables in the warehouse.

Constraints

Existing team has stronger SQL/Python skills than Spark expertise.
Budget target is <$18K/month incremental cloud spend.
PII data must be encrypted at rest and in transit.
The design should prefer managed services over self-hosted infrastructure where possible.
The company may standardize on either AWS or Azure, so your design should justify the platform choice clearly.

Context

You are asked to design a cloud-native batch ETL platform that ingests data from operational databases, object storage, and SaaS APIs into an analytics warehouse.

Scale Requirements

Sources: 12 PostgreSQL/MySQL databases, 4 SaaS APIs, and daily CSV drops
Volume: 2.5 TB/day raw data, ~18 billion rows total historical backfill
Batch frequency: Hourly incremental loads for critical tables; daily full loads for low-priority sources
Latency target: Critical datasets queryable within 30 minutes of source update
Retention: 1 year raw zone, 5 years curated warehouse data

Requirements

Design a cloud architecture on AWS or Azure for ingesting relational, file-based, and API data sources.
Support both historical backfills and incremental loads using CDC timestamps or change tracking.
Implement transformation layers for raw, cleaned, and curated datasets.
Orchestrate dependencies between ingestion, validation, transformation, and warehouse loading jobs.
Define how you would enforce idempotency, schema evolution handling, and data quality checks.
Provide monitoring for job failures, SLA breaches, row-count anomalies, and cost spikes.
Explain how analytics users will access trusted tables in the warehouse.

Constraints

Existing team has stronger SQL/Python skills than Spark expertise.
Budget target is <$18K/month incremental cloud spend.
PII data must be encrypted at rest and in transit.
The design should prefer managed services over self-hosted infrastructure where possible.
The company may standardize on either AWS or Azure, so your design should justify the platform choice clearly.

Context

You are asked to design a cloud-native batch ETL platform that ingests data from operational databases, object storage, and SaaS APIs into an analytics warehouse.

Scale Requirements

Sources: 12 PostgreSQL/MySQL databases, 4 SaaS APIs, and daily CSV drops
Volume: 2.5 TB/day raw data, ~18 billion rows total historical backfill
Batch frequency: Hourly incremental loads for critical tables; daily full loads for low-priority sources
Latency target: Critical datasets queryable within 30 minutes of source update
Retention: 1 year raw zone, 5 years curated warehouse data

Requirements

Design a cloud architecture on AWS or Azure for ingesting relational, file-based, and API data sources.
Support both historical backfills and incremental loads using CDC timestamps or change tracking.
Implement transformation layers for raw, cleaned, and curated datasets.
Orchestrate dependencies between ingestion, validation, transformation, and warehouse loading jobs.
Define how you would enforce idempotency, schema evolution handling, and data quality checks.
Provide monitoring for job failures, SLA breaches, row-count anomalies, and cost spikes.
Explain how analytics users will access trusted tables in the warehouse.

Constraints

Existing team has stronger SQL/Python skills than Spark expertise.
Budget target is <$18K/month incremental cloud spend.
PII data must be encrypted at rest and in transit.
The design should prefer managed services over self-hosted infrastructure where possible.
The company may standardize on either AWS or Azure, so your design should justify the platform choice clearly.

Context

You are asked to design a cloud-native batch ETL platform that ingests data from operational databases, object storage, and SaaS APIs into an analytics warehouse.

Scale Requirements

Sources: 12 PostgreSQL/MySQL databases, 4 SaaS APIs, and daily CSV drops
Volume: 2.5 TB/day raw data, ~18 billion rows total historical backfill
Batch frequency: Hourly incremental loads for critical tables; daily full loads for low-priority sources
Latency target: Critical datasets queryable within 30 minutes of source update
Retention: 1 year raw zone, 5 years curated warehouse data

Requirements

Design a cloud architecture on AWS or Azure for ingesting relational, file-based, and API data sources.
Support both historical backfills and incremental loads using CDC timestamps or change tracking.
Implement transformation layers for raw, cleaned, and curated datasets.
Orchestrate dependencies between ingestion, validation, transformation, and warehouse loading jobs.
Define how you would enforce idempotency, schema evolution handling, and data quality checks.
Provide monitoring for job failures, SLA breaches, row-count anomalies, and cost spikes.
Explain how analytics users will access trusted tables in the warehouse.

Constraints

Existing team has stronger SQL/Python skills than Spark expertise.
Budget target is <$18K/month incremental cloud spend.
PII data must be encrypted at rest and in transit.
The design should prefer managed services over self-hosted infrastructure where possible.
The company may standardize on either AWS or Azure, so your design should justify the platform choice clearly.

Interview Guides

Context

Scale Requirements

Requirements

Constraints

Design Cloud ETL Migration Pipeline

Context

Scale Requirements

Requirements

Constraints

Your Answer

Design Cloud ETL Migration Pipeline

Context

Scale Requirements

Requirements

Constraints

Design Cloud ETL Migration Pipeline

Context

Scale Requirements

Requirements

Constraints

Your Answer