Design an ETL Pipeline with Data Quality Checks

Context

RetailCorp, a leading e-commerce platform, handles approximately 10TB of sales data daily from multiple sources, including transactional databases, web logs, and third-party APIs. The current ETL process is batch-oriented, running nightly, which leads to data freshness issues and delayed insights for the business analytics team. To address this, the VP of Data Engineering has mandated the design of a new ETL pipeline that ensures data quality and provides near real-time analytics capabilities.

Current Architecture

Component	Technology	Issue
Data Sources	MySQL, REST APIs, and S3	Daily batch load causes data latency
ETL Tool	Apache Nifi	Limited data quality checks
Storage	Amazon Redshift	Slow query performance due to unoptimized data structure
Orchestration	Apache Airflow	Complex and difficult to manage

Scale Requirements

Throughput: Process 10TB of data daily, averaging 400GB/hour.
Latency: Ensure data is available for querying within 1 hour of ingestion.
Retention: Store raw data for 30 days and aggregated data indefinitely.

Requirements

Design an ETL pipeline that ingests data from multiple sources, ensuring data integrity and quality checks at each stage.
Implement transformations to optimize data for analytics, including deduplication, validation, and schema enforcement.
Load processed data into Amazon Redshift with optimized table structures for performance.
Create monitoring and alerting mechanisms for data quality issues, latency, and system health.
Ensure the pipeline is orchestrated using Apache Airflow with clear dependencies and error handling.

Constraints

Team: 5 data engineers with experience in Python and AWS.
Infrastructure: AWS-based environment (Redshift, S3, Lambda).
Budget: $15K/month for cloud services.

Context

Component

Technology

Issue

Data Sources

MySQL, REST APIs, and S3

Daily batch load causes data latency

ETL Tool

Apache Nifi

Limited data quality checks

Storage

Amazon Redshift

Slow query performance due to unoptimized data structure

Orchestration

Apache Airflow

Complex and difficult to manage

Requirements

Design an ETL pipeline that ingests data from multiple sources, ensuring data integrity and quality checks at each stage.

Implement transformations to optimize data for analytics, including deduplication, validation, and schema enforcement.

Load processed data into Amazon Redshift with optimized table structures for performance.

Create monitoring and alerting mechanisms for data quality issues, latency, and system health.

Ensure the pipeline is orchestrated using Apache Airflow with clear dependencies and error handling.

Context

Component

Technology

Issue

Data Sources

MySQL, REST APIs, and S3

Daily batch load causes data latency

ETL Tool

Apache Nifi

Limited data quality checks

Storage

Amazon Redshift

Slow query performance due to unoptimized data structure

Orchestration

Apache Airflow

Complex and difficult to manage

Requirements

Design an ETL pipeline that ingests data from multiple sources, ensuring data integrity and quality checks at each stage.

Implement transformations to optimize data for analytics, including deduplication, validation, and schema enforcement.

Load processed data into Amazon Redshift with optimized table structures for performance.

Create monitoring and alerting mechanisms for data quality issues, latency, and system health.

Ensure the pipeline is orchestrated using Apache Airflow with clear dependencies and error handling.

Context

Component

Technology

Issue

Data Sources

MySQL, REST APIs, and S3

Daily batch load causes data latency

ETL Tool

Apache Nifi

Limited data quality checks

Storage

Amazon Redshift

Slow query performance due to unoptimized data structure

Orchestration

Apache Airflow

Complex and difficult to manage

Requirements

Design an ETL pipeline that ingests data from multiple sources, ensuring data integrity and quality checks at each stage.

Implement transformations to optimize data for analytics, including deduplication, validation, and schema enforcement.

Load processed data into Amazon Redshift with optimized table structures for performance.

Create monitoring and alerting mechanisms for data quality issues, latency, and system health.

Ensure the pipeline is orchestrated using Apache Airflow with clear dependencies and error handling.

Interview Guides

Context

Current Architecture

Scale Requirements

Requirements

Constraints

Design an ETL Pipeline with Data Quality Checks

Context

Current Architecture

Scale Requirements

Requirements

Constraints

Your Answer

Design an ETL Pipeline with Data Quality Checks

Context

Current Architecture

Scale Requirements

Requirements

Constraints

Design an ETL Pipeline with Data Quality Checks

Context

Current Architecture

Scale Requirements

Requirements

Constraints

Your Answer