Interview Guides

Handle Missing Data in ETL

Easy

Pipelines

Context

MediCart, a mid-sized e-commerce marketplace, loads order, payment, inventory, and shipment data from PostgreSQL, Stripe webhooks, and CSV files from 40 logistics partners into Snowflake. The current Airflow-based nightly ETL frequently produces incomplete fact tables because upstream files arrive late, webhook events are duplicated, and some source fields are null or missing.

The analytics team needs a redesigned pipeline that can detect missing or incomplete data early, prevent bad loads from reaching reporting tables, and support safe backfills without double-counting revenue or orders.

Scale Requirements

Batch volume: 120M order-related records/day
File ingestion: 8,000 partner CSV files/day, 5-500 MB each
Streaming side input: 15K payment/shipment events/sec peak
Latency target: curated tables available by 6:00 AM UTC; critical payment completeness checks within 10 minutes
Storage: 60 TB historical raw data retained for 1 year

Requirements

Design an ingestion and transformation pipeline that combines batch files, CDC from PostgreSQL, and event streams into Snowflake.
Detect missing records, null critical fields, late-arriving files, and partial loads before publishing curated tables.
Define how to quarantine bad data, replay failed partitions, and backfill missing days safely.
Ensure idempotent processing for duplicate webhook events and repeated file deliveries.
Implement data quality checks for row-count reconciliation, freshness, schema validation, and referential integrity.
Describe monitoring, alerting, and operational ownership for incomplete-data incidents.

Constraints

Existing stack is AWS, Snowflake, Airflow, and dbt; avoid introducing more than one major new platform.
Budget increase is capped at $15K/month.
SOX compliance requires auditable lineage and reproducible reruns.
Downstream finance dashboards must not show partially complete daily revenue tables.
Team size: 3 data engineers and 1 analytics engineer.

Handle Missing Data in ETL

Easy

Pipelines

Context

Scale Requirements

Batch volume: 120M order-related records/day
File ingestion: 8,000 partner CSV files/day, 5-500 MB each
Streaming side input: 15K payment/shipment events/sec peak
Latency target: curated tables available by 6:00 AM UTC; critical payment completeness checks within 10 minutes
Storage: 60 TB historical raw data retained for 1 year

Requirements

Design an ingestion and transformation pipeline that combines batch files, CDC from PostgreSQL, and event streams into Snowflake.
Detect missing records, null critical fields, late-arriving files, and partial loads before publishing curated tables.
Define how to quarantine bad data, replay failed partitions, and backfill missing days safely.
Ensure idempotent processing for duplicate webhook events and repeated file deliveries.
Implement data quality checks for row-count reconciliation, freshness, schema validation, and referential integrity.
Describe monitoring, alerting, and operational ownership for incomplete-data incidents.

Constraints

Existing stack is AWS, Snowflake, Airflow, and dbt; avoid introducing more than one major new platform.
Budget increase is capped at $15K/month.
SOX compliance requires auditable lineage and reproducible reruns.
Downstream finance dashboards must not show partially complete daily revenue tables.
Team size: 3 data engineers and 1 analytics engineer.

Your Answer

Handle Missing Data in ETL

Easy

Pipelines

Context

Scale Requirements

Batch volume: 120M order-related records/day
File ingestion: 8,000 partner CSV files/day, 5-500 MB each
Streaming side input: 15K payment/shipment events/sec peak
Latency target: curated tables available by 6:00 AM UTC; critical payment completeness checks within 10 minutes
Storage: 60 TB historical raw data retained for 1 year

Requirements

Design an ingestion and transformation pipeline that combines batch files, CDC from PostgreSQL, and event streams into Snowflake.
Detect missing records, null critical fields, late-arriving files, and partial loads before publishing curated tables.
Define how to quarantine bad data, replay failed partitions, and backfill missing days safely.
Ensure idempotent processing for duplicate webhook events and repeated file deliveries.
Implement data quality checks for row-count reconciliation, freshness, schema validation, and referential integrity.
Describe monitoring, alerting, and operational ownership for incomplete-data incidents.

Constraints

Existing stack is AWS, Snowflake, Airflow, and dbt; avoid introducing more than one major new platform.
Budget increase is capped at $15K/month.
SOX compliance requires auditable lineage and reproducible reruns.
Downstream finance dashboards must not show partially complete daily revenue tables.
Team size: 3 data engineers and 1 analytics engineer.

Handle Missing Data in ETL

Easy

Pipelines

Context

Scale Requirements

Batch volume: 120M order-related records/day
File ingestion: 8,000 partner CSV files/day, 5-500 MB each
Streaming side input: 15K payment/shipment events/sec peak
Latency target: curated tables available by 6:00 AM UTC; critical payment completeness checks within 10 minutes
Storage: 60 TB historical raw data retained for 1 year

Requirements

Design an ingestion and transformation pipeline that combines batch files, CDC from PostgreSQL, and event streams into Snowflake.
Detect missing records, null critical fields, late-arriving files, and partial loads before publishing curated tables.
Define how to quarantine bad data, replay failed partitions, and backfill missing days safely.
Ensure idempotent processing for duplicate webhook events and repeated file deliveries.
Implement data quality checks for row-count reconciliation, freshness, schema validation, and referential integrity.
Describe monitoring, alerting, and operational ownership for incomplete-data incidents.

Constraints

Existing stack is AWS, Snowflake, Airflow, and dbt; avoid introducing more than one major new platform.
Budget increase is capped at $15K/month.
SOX compliance requires auditable lineage and reproducible reruns.
Downstream finance dashboards must not show partially complete daily revenue tables.
Team size: 3 data engineers and 1 analytics engineer.