Build Dataset Cleaning ETL Pipeline

Context

InsightHub, a B2B analytics company, receives daily customer data exports from SaaS tools, internal PostgreSQL databases, and third-party CSV uploads. Analysts currently clean files manually in notebooks, causing inconsistent logic, duplicate records, and delayed reporting.

You need to design a production-grade batch pipeline that ingests raw datasets, standardizes and validates them, and publishes clean tables for downstream analysis. The goal is to make data preparation repeatable, auditable, and easy to backfill.

Scale Requirements

Sources: 25 daily CSV/JSON feeds + 3 PostgreSQL source tables
Volume: 300 GB raw data/day, growing 8% month-over-month
Row count: ~1.2 billion records/day across all sources
Latency target: Raw file arrival to analytics-ready tables in < 2 hours
Retention: 180 days raw, 3 years curated
SLA: 99.5% successful daily pipeline completion by 6:00 AM UTC

Requirements

Ingest raw files and database extracts into a landing zone without modifying source data.
Detect schema issues, malformed rows, null spikes, duplicates, and invalid business values.
Standardize types, timestamps, categorical values, and column names across datasets.
Implement deterministic deduplication and basic enrichment (reference table joins, derived columns).
Publish cleaned, analytics-ready tables to a warehouse for BI and ad hoc SQL analysis.
Support idempotent reruns, partition-based backfills, and auditability of every transformation.
Define monitoring, alerting, and failure recovery for late files, bad schemas, and partial loads.

Constraints

Existing stack is AWS-centric and already uses S3, Airflow, and Snowflake.
Team size is 3 data engineers; operational complexity should stay low.
Budget for incremental infrastructure is capped at $15K/month.
Some datasets contain PII and must support column-level masking and deletion requests within 7 days.

Context

Scale Requirements

Sources: 25 daily CSV/JSON feeds + 3 PostgreSQL source tables
Volume: 300 GB raw data/day, growing 8% month-over-month
Row count: ~1.2 billion records/day across all sources
Latency target: Raw file arrival to analytics-ready tables in < 2 hours
Retention: 180 days raw, 3 years curated
SLA: 99.5% successful daily pipeline completion by 6:00 AM UTC

Requirements

Ingest raw files and database extracts into a landing zone without modifying source data.
Detect schema issues, malformed rows, null spikes, duplicates, and invalid business values.
Standardize types, timestamps, categorical values, and column names across datasets.
Implement deterministic deduplication and basic enrichment (reference table joins, derived columns).
Publish cleaned, analytics-ready tables to a warehouse for BI and ad hoc SQL analysis.
Support idempotent reruns, partition-based backfills, and auditability of every transformation.
Define monitoring, alerting, and failure recovery for late files, bad schemas, and partial loads.

Constraints

Existing stack is AWS-centric and already uses S3, Airflow, and Snowflake.
Team size is 3 data engineers; operational complexity should stay low.
Budget for incremental infrastructure is capped at $15K/month.
Some datasets contain PII and must support column-level masking and deletion requests within 7 days.

Context

Scale Requirements

Sources: 25 daily CSV/JSON feeds + 3 PostgreSQL source tables
Volume: 300 GB raw data/day, growing 8% month-over-month
Row count: ~1.2 billion records/day across all sources
Latency target: Raw file arrival to analytics-ready tables in < 2 hours
Retention: 180 days raw, 3 years curated
SLA: 99.5% successful daily pipeline completion by 6:00 AM UTC

Requirements

Ingest raw files and database extracts into a landing zone without modifying source data.
Detect schema issues, malformed rows, null spikes, duplicates, and invalid business values.
Standardize types, timestamps, categorical values, and column names across datasets.
Implement deterministic deduplication and basic enrichment (reference table joins, derived columns).
Publish cleaned, analytics-ready tables to a warehouse for BI and ad hoc SQL analysis.
Support idempotent reruns, partition-based backfills, and auditability of every transformation.
Define monitoring, alerting, and failure recovery for late files, bad schemas, and partial loads.

Constraints

Existing stack is AWS-centric and already uses S3, Airflow, and Snowflake.
Team size is 3 data engineers; operational complexity should stay low.
Budget for incremental infrastructure is capped at $15K/month.
Some datasets contain PII and must support column-level masking and deletion requests within 7 days.

Context

Scale Requirements

Sources: 25 daily CSV/JSON feeds + 3 PostgreSQL source tables
Volume: 300 GB raw data/day, growing 8% month-over-month
Row count: ~1.2 billion records/day across all sources
Latency target: Raw file arrival to analytics-ready tables in < 2 hours
Retention: 180 days raw, 3 years curated
SLA: 99.5% successful daily pipeline completion by 6:00 AM UTC

Requirements

Ingest raw files and database extracts into a landing zone without modifying source data.
Detect schema issues, malformed rows, null spikes, duplicates, and invalid business values.
Standardize types, timestamps, categorical values, and column names across datasets.
Implement deterministic deduplication and basic enrichment (reference table joins, derived columns).
Publish cleaned, analytics-ready tables to a warehouse for BI and ad hoc SQL analysis.
Support idempotent reruns, partition-based backfills, and auditability of every transformation.
Define monitoring, alerting, and failure recovery for late files, bad schemas, and partial loads.

Constraints

Existing stack is AWS-centric and already uses S3, Airflow, and Snowflake.
Team size is 3 data engineers; operational complexity should stay low.
Budget for incremental infrastructure is capped at $15K/month.
Some datasets contain PII and must support column-level masking and deletion requests within 7 days.

Interview Guides

Context

Scale Requirements

Requirements

Constraints

Build Dataset Cleaning ETL Pipeline

Context

Scale Requirements

Requirements

Constraints

Your Answer

Build Dataset Cleaning ETL Pipeline

Context

Scale Requirements

Requirements

Constraints

Build Dataset Cleaning ETL Pipeline

Context

Scale Requirements

Requirements

Constraints

Your Answer