Predict Pipeline Failures Before Impact

Scenario

You are responsible for a business-critical data platform that feeds supply chain, finance, and operational dashboards. Over the last quarter, several upstream connector slowdowns and transformation regressions caused missed SLAs, but the failures were only detected after downstream tables were already stale. Leadership now wants a pipeline that uses operational telemetry to predict likely job or system failures 30-60 minutes before they happen, so the team can reroute, scale, or pause dependent workloads before business users are affected.

Current State

Component	Status / Technology
Source ingestion	SAP, SFTP, REST APIs, and factory telemetry via Kafka
Orchestration	Apache Airflow 2.x with DAG dependencies and retries
Processing	Spark on Databricks for batch and Structured Streaming
Storage	ADLS Gen2 bronze/silver/gold lakehouse + Snowflake marts
Transformations	dbt incremental models
Monitoring	Azure Monitor, Datadog, Airflow alerts, Great Expectations

Scale: ~18K Airflow task runs/day, 7 TB/day ingested, 120 streaming topics, 1.5M events/min peak, 15-minute freshness SLA for priority datasets, 99.5% monthly pipeline success target.

Question

How would you design a predictive reliability pipeline that detects leading indicators of pipeline or infrastructure failure before downstream SLAs are missed, and how would you integrate it into orchestration, remediation, and data quality controls without creating excessive false positives?

Scenario

Current State

Component	Status / Technology
Source ingestion	SAP, SFTP, REST APIs, and factory telemetry via Kafka
Orchestration	Apache Airflow 2.x with DAG dependencies and retries
Processing	Spark on Databricks for batch and Structured Streaming
Storage	ADLS Gen2 bronze/silver/gold lakehouse + Snowflake marts
Transformations	dbt incremental models
Monitoring	Azure Monitor, Datadog, Airflow alerts, Great Expectations

Scale: ~18K Airflow task runs/day, 7 TB/day ingested, 120 streaming topics, 1.5M events/min peak, 15-minute freshness SLA for priority datasets, 99.5% monthly pipeline success target.

Scenario

Current State

Component	Status / Technology
Source ingestion	SAP, SFTP, REST APIs, and factory telemetry via Kafka
Orchestration	Apache Airflow 2.x with DAG dependencies and retries
Processing	Spark on Databricks for batch and Structured Streaming
Storage	ADLS Gen2 bronze/silver/gold lakehouse + Snowflake marts
Transformations	dbt incremental models
Monitoring	Azure Monitor, Datadog, Airflow alerts, Great Expectations

Scale: ~18K Airflow task runs/day, 7 TB/day ingested, 120 streaming topics, 1.5M events/min peak, 15-minute freshness SLA for priority datasets, 99.5% monthly pipeline success target.

Scenario

Current State

Component	Status / Technology
Source ingestion	SAP, SFTP, REST APIs, and factory telemetry via Kafka
Orchestration	Apache Airflow 2.x with DAG dependencies and retries
Processing	Spark on Databricks for batch and Structured Streaming
Storage	ADLS Gen2 bronze/silver/gold lakehouse + Snowflake marts
Transformations	dbt incremental models
Monitoring	Azure Monitor, Datadog, Airflow alerts, Great Expectations

Scale: ~18K Airflow task runs/day, 7 TB/day ingested, 120 streaming topics, 1.5M events/min peak, 15-minute freshness SLA for priority datasets, 99.5% monthly pipeline success target.

Interview Guides

Scenario

Current State

Question

Predict Pipeline Failures Before Impact

Scenario

Current State

Question

Your Answer

Predict Pipeline Failures Before Impact

Scenario

Current State

Question

Predict Pipeline Failures Before Impact

Scenario

Current State

Question

Your Answer