Design a Cloud-Based ETL Pipeline for Retail Analytics

Context

RetailCorp, a major retail chain, collects vast amounts of transactional data from over 1,000 stores nationwide. Currently, data is processed in daily batch jobs using on-premise infrastructure, leading to delays in analytics and reporting. The data engineering team aims to transition to a cloud-based ETL pipeline to improve data freshness and scalability.

Scale Requirements

Data Volume: 10TB of transaction data weekly, averaging 1.5TB daily.
Latency: ETL jobs must complete within 2 hours of data availability.
Concurrency: Support 50 simultaneous ETL jobs during peak hours.
Retention: Store raw data for 30 days and aggregated data indefinitely.

Requirements

Design an ETL pipeline that extracts data from multiple sources (POS systems, online transactions, and inventory databases).
Implement data quality checks (schema validation, completeness checks, and anomaly detection).
Transform and aggregate data for analytics, creating summary tables in Snowflake.
Schedule and orchestrate ETL jobs using a cloud-native orchestration tool.
Enable monitoring and alerting for data quality issues and job failures.

Constraints

Infrastructure: Transition to AWS or GCP for cloud services.
Budget: Monthly cloud expenditure must not exceed $15K.
Compliance: Must adhere to PCI DSS for handling payment information.

Context

Scale Requirements

Data Volume: 10TB of transaction data weekly, averaging 1.5TB daily.
Latency: ETL jobs must complete within 2 hours of data availability.
Concurrency: Support 50 simultaneous ETL jobs during peak hours.
Retention: Store raw data for 30 days and aggregated data indefinitely.

Requirements

Design an ETL pipeline that extracts data from multiple sources (POS systems, online transactions, and inventory databases).
Implement data quality checks (schema validation, completeness checks, and anomaly detection).
Transform and aggregate data for analytics, creating summary tables in Snowflake.
Schedule and orchestrate ETL jobs using a cloud-native orchestration tool.
Enable monitoring and alerting for data quality issues and job failures.

Constraints

Infrastructure: Transition to AWS or GCP for cloud services.
Budget: Monthly cloud expenditure must not exceed $15K.
Compliance: Must adhere to PCI DSS for handling payment information.

Context

Scale Requirements

Data Volume: 10TB of transaction data weekly, averaging 1.5TB daily.
Latency: ETL jobs must complete within 2 hours of data availability.
Concurrency: Support 50 simultaneous ETL jobs during peak hours.
Retention: Store raw data for 30 days and aggregated data indefinitely.

Requirements

Design an ETL pipeline that extracts data from multiple sources (POS systems, online transactions, and inventory databases).
Implement data quality checks (schema validation, completeness checks, and anomaly detection).
Transform and aggregate data for analytics, creating summary tables in Snowflake.
Schedule and orchestrate ETL jobs using a cloud-native orchestration tool.
Enable monitoring and alerting for data quality issues and job failures.

Constraints

Infrastructure: Transition to AWS or GCP for cloud services.
Budget: Monthly cloud expenditure must not exceed $15K.
Compliance: Must adhere to PCI DSS for handling payment information.

Context

Scale Requirements

Data Volume: 10TB of transaction data weekly, averaging 1.5TB daily.
Latency: ETL jobs must complete within 2 hours of data availability.
Concurrency: Support 50 simultaneous ETL jobs during peak hours.
Retention: Store raw data for 30 days and aggregated data indefinitely.

Requirements

Design an ETL pipeline that extracts data from multiple sources (POS systems, online transactions, and inventory databases).
Implement data quality checks (schema validation, completeness checks, and anomaly detection).
Transform and aggregate data for analytics, creating summary tables in Snowflake.
Schedule and orchestrate ETL jobs using a cloud-native orchestration tool.
Enable monitoring and alerting for data quality issues and job failures.

Constraints

Infrastructure: Transition to AWS or GCP for cloud services.
Budget: Monthly cloud expenditure must not exceed $15K.
Compliance: Must adhere to PCI DSS for handling payment information.

Interview Guides

Context

Scale Requirements

Requirements

Constraints

Design a Cloud-Based ETL Pipeline for Retail Analytics

Context

Scale Requirements

Requirements

Constraints

Your Answer

Design a Cloud-Based ETL Pipeline for Retail Analytics

Context

Scale Requirements

Requirements

Constraints

Design a Cloud-Based ETL Pipeline for Retail Analytics

Context

Scale Requirements

Requirements

Constraints

Your Answer