Dataford
Interview Guides
Upgrade
All questions/Pipelines/Design an ETL Pipeline with Data Quality Checks

Design an ETL Pipeline with Data Quality Checks

Medium
Pipelines
Asked at 27 companies27Deep Learning
Also asked at
TE ConnectivityTiger AnalyticsaCity National BankEdward JonesBenjamin Moore

Problem

Context

RetailCorp, a leading e-commerce platform, handles approximately 10TB of sales data daily from multiple sources, including transactional databases, web logs, and third-party APIs. The current ETL process is batch-oriented, running nightly, which leads to data freshness issues and delayed insights for the business analytics team. To address this, the VP of Data Engineering has mandated the design of a new ETL pipeline that ensures data quality and provides near real-time analytics capabilities.

Current Architecture

ComponentTechnologyIssue
Data SourcesMySQL, REST APIs, and S3Daily batch load causes data latency
ETL ToolApache NifiLimited data quality checks
StorageAmazon RedshiftSlow query performance due to unoptimized data structure
OrchestrationApache AirflowComplex and difficult to manage

Scale Requirements

  • Throughput: Process 10TB of data daily, averaging 400GB/hour.
  • Latency: Ensure data is available for querying within 1 hour of ingestion.
  • Retention: Store raw data for 30 days and aggregated data indefinitely.

Requirements

  1. Design an ETL pipeline that ingests data from multiple sources, ensuring data integrity and quality checks at each stage.
  2. Implement transformations to optimize data for analytics, including deduplication, validation, and schema enforcement.
  3. Load processed data into Amazon Redshift with optimized table structures for performance.
  4. Create monitoring and alerting mechanisms for data quality issues, latency, and system health.
  5. Ensure the pipeline is orchestrated using Apache Airflow with clear dependencies and error handling.

Constraints

  • Team: 5 data engineers with experience in Python and AWS.
  • Infrastructure: AWS-based environment (Redshift, S3, Lambda).
  • Budget: $15K/month for cloud services.

Problem

Context

RetailCorp, a leading e-commerce platform, handles approximately 10TB of sales data daily from multiple sources, including transactional databases, web logs, and third-party APIs. The current ETL process is batch-oriented, running nightly, which leads to data freshness issues and delayed insights for the business analytics team. To address this, the VP of Data Engineering has mandated the design of a new ETL pipeline that ensures data quality and provides near real-time analytics capabilities.

Current Architecture

ComponentTechnologyIssue
Data SourcesMySQL, REST APIs, and S3Daily batch load causes data latency
ETL ToolApache NifiLimited data quality checks
StorageAmazon RedshiftSlow query performance due to unoptimized data structure
OrchestrationApache AirflowComplex and difficult to manage

Scale Requirements

  • Throughput: Process 10TB of data daily, averaging 400GB/hour.
  • Latency: Ensure data is available for querying within 1 hour of ingestion.
  • Retention: Store raw data for 30 days and aggregated data indefinitely.

Requirements

  1. Design an ETL pipeline that ingests data from multiple sources, ensuring data integrity and quality checks at each stage.
  2. Implement transformations to optimize data for analytics, including deduplication, validation, and schema enforcement.
  3. Load processed data into Amazon Redshift with optimized table structures for performance.
  4. Create monitoring and alerting mechanisms for data quality issues, latency, and system health.
  5. Ensure the pipeline is orchestrated using Apache Airflow with clear dependencies and error handling.

Constraints

  • Team: 5 data engineers with experience in Python and AWS.
  • Infrastructure: AWS-based environment (Redshift, S3, Lambda).
  • Budget: $15K/month for cloud services.
Your answer
Try one AI text evaluation on us
Get structured feedback, scored against a 4-axis rubric. Premium unlocks unlimited.
0 wordstarget ~200
Up next
DICK'S Sporting GoodsDesign ETL Pipeline for Retail Sales DataMediumSoFiDesign an ETL Pipeline for Large DatasetsMediumCiscoDesign Robust ETL Pipeline for E-Commerce AnalyticsMedium
Next question