Interview Guides

Troubleshoot Nightly Orders Pipeline Failure

Easy

Pipelines

Context

Northstar Retail runs a nightly batch pipeline that loads order, payment, and shipment data from PostgreSQL and S3 into Snowflake for finance and operations reporting. Over the last month, the pipeline has failed intermittently after a schema change and occasional upstream delays, causing missing daily revenue dashboards by 8 AM.

The current stack uses Apache Airflow 2.x for orchestration, AWS Glue PySpark jobs for transformations, Amazon S3 as the raw landing zone, and Snowflake for warehouse storage. You need to design how you would troubleshoot failures quickly, prevent bad data from reaching downstream tables, and make the pipeline recoverable without manual cleanup.

Scale Requirements

Sources: PostgreSQL OLTP (~150 tables), S3 partner files (~2,000 files/day)
Daily volume: 450M order-related records/day, ~1.8 TB raw compressed data
SLA: Curated finance tables available by 8:00 AM UTC
Batch window: 1:00 AM-6:30 AM UTC
Latency target: Failed task detection within 5 minutes; full recovery within 60 minutes
Retention: Raw data 180 days, curated tables 3 years

Requirements

Design a batch pipeline that ingests incremental order data and partner files into Snowflake.
Explain how you would detect and troubleshoot failures caused by schema drift, late-arriving files, duplicate loads, and task dependency issues.
Ensure transformations are idempotent so reruns do not double-count revenue.
Add data quality checks for row-count anomalies, null primary keys, duplicate order IDs, and reconciliation against source totals.
Define alerting, retry behavior, backfill strategy, and dead-letter handling for bad files.
Support partition-level reruns for a single business date without reprocessing the full pipeline.

Constraints

AWS-first environment; no new streaming platform may be introduced
Incremental budget cap: $12K/month
Small team: 3 data engineers, shared DevOps support
SOX compliance: finance tables require auditability and reproducible reruns

Troubleshoot Nightly Orders Pipeline Failure

Easy

Pipelines

Context

Scale Requirements

Sources: PostgreSQL OLTP (~150 tables), S3 partner files (~2,000 files/day)
Daily volume: 450M order-related records/day, ~1.8 TB raw compressed data
SLA: Curated finance tables available by 8:00 AM UTC
Batch window: 1:00 AM-6:30 AM UTC
Latency target: Failed task detection within 5 minutes; full recovery within 60 minutes
Retention: Raw data 180 days, curated tables 3 years

Requirements

Design a batch pipeline that ingests incremental order data and partner files into Snowflake.
Explain how you would detect and troubleshoot failures caused by schema drift, late-arriving files, duplicate loads, and task dependency issues.
Ensure transformations are idempotent so reruns do not double-count revenue.
Add data quality checks for row-count anomalies, null primary keys, duplicate order IDs, and reconciliation against source totals.
Define alerting, retry behavior, backfill strategy, and dead-letter handling for bad files.
Support partition-level reruns for a single business date without reprocessing the full pipeline.

Constraints

AWS-first environment; no new streaming platform may be introduced
Incremental budget cap: $12K/month
Small team: 3 data engineers, shared DevOps support
SOX compliance: finance tables require auditability and reproducible reruns

Your Answer

Troubleshoot Nightly Orders Pipeline Failure

Easy

Pipelines

Context

Scale Requirements

Sources: PostgreSQL OLTP (~150 tables), S3 partner files (~2,000 files/day)
Daily volume: 450M order-related records/day, ~1.8 TB raw compressed data
SLA: Curated finance tables available by 8:00 AM UTC
Batch window: 1:00 AM-6:30 AM UTC
Latency target: Failed task detection within 5 minutes; full recovery within 60 minutes
Retention: Raw data 180 days, curated tables 3 years

Requirements

Design a batch pipeline that ingests incremental order data and partner files into Snowflake.
Explain how you would detect and troubleshoot failures caused by schema drift, late-arriving files, duplicate loads, and task dependency issues.
Ensure transformations are idempotent so reruns do not double-count revenue.
Add data quality checks for row-count anomalies, null primary keys, duplicate order IDs, and reconciliation against source totals.
Define alerting, retry behavior, backfill strategy, and dead-letter handling for bad files.
Support partition-level reruns for a single business date without reprocessing the full pipeline.

Constraints

AWS-first environment; no new streaming platform may be introduced
Incremental budget cap: $12K/month
Small team: 3 data engineers, shared DevOps support
SOX compliance: finance tables require auditability and reproducible reruns

Troubleshoot Nightly Orders Pipeline Failure

Easy

Pipelines

Context

Scale Requirements

Sources: PostgreSQL OLTP (~150 tables), S3 partner files (~2,000 files/day)
Daily volume: 450M order-related records/day, ~1.8 TB raw compressed data
SLA: Curated finance tables available by 8:00 AM UTC
Batch window: 1:00 AM-6:30 AM UTC
Latency target: Failed task detection within 5 minutes; full recovery within 60 minutes
Retention: Raw data 180 days, curated tables 3 years

Requirements

Design a batch pipeline that ingests incremental order data and partner files into Snowflake.
Explain how you would detect and troubleshoot failures caused by schema drift, late-arriving files, duplicate loads, and task dependency issues.
Ensure transformations are idempotent so reruns do not double-count revenue.
Add data quality checks for row-count anomalies, null primary keys, duplicate order IDs, and reconciliation against source totals.
Define alerting, retry behavior, backfill strategy, and dead-letter handling for bad files.
Support partition-level reruns for a single business date without reprocessing the full pipeline.

Constraints

AWS-first environment; no new streaming platform may be introduced
Incremental budget cap: $12K/month
Small team: 3 data engineers, shared DevOps support
SOX compliance: finance tables require auditability and reproducible reruns