Design Multi-Source ETL Pipeline for Analytics

Context

DataCorp, a financial services provider, aggregates data from various sources including transactional databases, third-party APIs, and CSV files from partners. Currently, the data integration process is manual and error-prone, leading to delays and inconsistencies in reporting. The goal is to automate the ETL process to enable timely and accurate analytics for business intelligence.

Current Architecture

Component	Technology	Issue
Data Sources	MySQL, REST APIs, CSV files	Manual extraction and transformation
Ingestion	Custom scripts (Python)	High maintenance overhead
Processing	Batch jobs (Airflow)	Long processing times and lack of data validation
Storage	Snowflake	Data inconsistencies and outdated information

Scale Requirements

Data Volume: 10 million records from MySQL, 1 million records from APIs, and 500GB from CSV files monthly.
Latency Target: Data should be available in Snowflake within 2 hours of source updates.
Retention: Raw data for 30 days, aggregated data indefinitely.

Requirements

Design an ETL pipeline that extracts data from MySQL, APIs, and CSV sources.
Implement data transformation rules to clean and normalize data.
Ensure data quality checks (e.g., schema validation, duplicates removal).
Load processed data into Snowflake in a structured format.
Set up orchestration using Apache Airflow for scheduling and monitoring.

Constraints

Team: 3 data engineers with limited experience in streaming technologies.
Infrastructure: AWS environment with existing Snowflake setup.
Budget: $15K/month for cloud resources and tools.

Context

Current Architecture

Component	Technology	Issue
Data Sources	MySQL, REST APIs, CSV files	Manual extraction and transformation
Ingestion	Custom scripts (Python)	High maintenance overhead
Processing	Batch jobs (Airflow)	Long processing times and lack of data validation
Storage	Snowflake	Data inconsistencies and outdated information

Scale Requirements

Data Volume: 10 million records from MySQL, 1 million records from APIs, and 500GB from CSV files monthly.
Latency Target: Data should be available in Snowflake within 2 hours of source updates.
Retention: Raw data for 30 days, aggregated data indefinitely.

Requirements

Design an ETL pipeline that extracts data from MySQL, APIs, and CSV sources.
Implement data transformation rules to clean and normalize data.
Ensure data quality checks (e.g., schema validation, duplicates removal).
Load processed data into Snowflake in a structured format.
Set up orchestration using Apache Airflow for scheduling and monitoring.

Constraints

Team: 3 data engineers with limited experience in streaming technologies.
Infrastructure: AWS environment with existing Snowflake setup.
Budget: $15K/month for cloud resources and tools.

Context

Current Architecture

Component	Technology	Issue
Data Sources	MySQL, REST APIs, CSV files	Manual extraction and transformation
Ingestion	Custom scripts (Python)	High maintenance overhead
Processing	Batch jobs (Airflow)	Long processing times and lack of data validation
Storage	Snowflake	Data inconsistencies and outdated information

Scale Requirements

Data Volume: 10 million records from MySQL, 1 million records from APIs, and 500GB from CSV files monthly.
Latency Target: Data should be available in Snowflake within 2 hours of source updates.
Retention: Raw data for 30 days, aggregated data indefinitely.

Requirements

Design an ETL pipeline that extracts data from MySQL, APIs, and CSV sources.
Implement data transformation rules to clean and normalize data.
Ensure data quality checks (e.g., schema validation, duplicates removal).
Load processed data into Snowflake in a structured format.
Set up orchestration using Apache Airflow for scheduling and monitoring.

Constraints

Team: 3 data engineers with limited experience in streaming technologies.
Infrastructure: AWS environment with existing Snowflake setup.
Budget: $15K/month for cloud resources and tools.

Context

Current Architecture

Component	Technology	Issue
Data Sources	MySQL, REST APIs, CSV files	Manual extraction and transformation
Ingestion	Custom scripts (Python)	High maintenance overhead
Processing	Batch jobs (Airflow)	Long processing times and lack of data validation
Storage	Snowflake	Data inconsistencies and outdated information

Scale Requirements

Data Volume: 10 million records from MySQL, 1 million records from APIs, and 500GB from CSV files monthly.
Latency Target: Data should be available in Snowflake within 2 hours of source updates.
Retention: Raw data for 30 days, aggregated data indefinitely.

Requirements

Design an ETL pipeline that extracts data from MySQL, APIs, and CSV sources.
Implement data transformation rules to clean and normalize data.
Ensure data quality checks (e.g., schema validation, duplicates removal).
Load processed data into Snowflake in a structured format.
Set up orchestration using Apache Airflow for scheduling and monitoring.

Constraints

Team: 3 data engineers with limited experience in streaming technologies.
Infrastructure: AWS environment with existing Snowflake setup.
Budget: $15K/month for cloud resources and tools.

Interview Guides

Context

Current Architecture

Scale Requirements

Requirements

Constraints

Design Multi-Source ETL Pipeline for Analytics

Context

Current Architecture

Scale Requirements

Requirements

Constraints

Your Answer

Design Multi-Source ETL Pipeline for Analytics

Context

Current Architecture

Scale Requirements

Requirements

Constraints

Design Multi-Source ETL Pipeline for Analytics

Context

Current Architecture

Scale Requirements

Requirements

Constraints

Your Answer