Coordinate Multi-Step ETL Dependencies

Context

FinEdge, a mid-size fintech company, runs daily ETL pipelines to produce finance and risk reporting in Snowflake. The current Airflow setup has grown organically: upstream extracts, staging loads, dimension builds, and downstream marts are connected with ad hoc task dependencies, causing missed SLAs, duplicate runs, and confusion during upstream delays.

You are asked to redesign how workflow dependencies are modeled and coordinated so that pipelines run reliably, recover cleanly, and make upstream/downstream readiness explicit.

Scale Requirements

Sources: 18 upstream systems (PostgreSQL, Salesforce, Stripe, S3 file drops)
Pipelines: 45 daily DAGs, ~600 tasks total
Volume: 2.5 TB/day raw data, 8B rows/day processed
SLA: Executive dashboards ready by 7:00 AM UTC
Latency: Critical finance marts available within 30 minutes of all upstream data arriving
Backfill window: Reprocess up to 180 days without corrupting downstream tables

Requirements

Design a dependency strategy for extract, stage, transform, and publish layers across multiple DAGs.
Explain how you would distinguish data dependencies from task completion dependencies.
Ensure downstream jobs do not run on partial or late upstream data.
Support idempotent reruns, partition-level backfills, and safe recovery after failures.
Define how readiness signals, dataset versioning, or partition markers should be stored and checked.
Include monitoring for blocked dependencies, SLA misses, and duplicate processing.
Show how orchestration should handle both scheduled runs and event-driven triggers from file arrivals.

Constraints

Existing stack is AWS-based and must remain in place.
Team size is 3 data engineers; operational complexity should stay low.
Finance data is SOX-auditable, so lineage and rerun history must be preserved.
Budget allows managed services already in use, but not a major platform rewrite.

Context

You are asked to redesign how workflow dependencies are modeled and coordinated so that pipelines run reliably, recover cleanly, and make upstream/downstream readiness explicit.

Scale Requirements

Sources: 18 upstream systems (PostgreSQL, Salesforce, Stripe, S3 file drops)
Pipelines: 45 daily DAGs, ~600 tasks total
Volume: 2.5 TB/day raw data, 8B rows/day processed
SLA: Executive dashboards ready by 7:00 AM UTC
Latency: Critical finance marts available within 30 minutes of all upstream data arriving
Backfill window: Reprocess up to 180 days without corrupting downstream tables

Requirements

Design a dependency strategy for extract, stage, transform, and publish layers across multiple DAGs.
Explain how you would distinguish data dependencies from task completion dependencies.
Ensure downstream jobs do not run on partial or late upstream data.
Support idempotent reruns, partition-level backfills, and safe recovery after failures.
Define how readiness signals, dataset versioning, or partition markers should be stored and checked.
Include monitoring for blocked dependencies, SLA misses, and duplicate processing.
Show how orchestration should handle both scheduled runs and event-driven triggers from file arrivals.

Constraints

Existing stack is AWS-based and must remain in place.
Team size is 3 data engineers; operational complexity should stay low.
Finance data is SOX-auditable, so lineage and rerun history must be preserved.
Budget allows managed services already in use, but not a major platform rewrite.

Context

You are asked to redesign how workflow dependencies are modeled and coordinated so that pipelines run reliably, recover cleanly, and make upstream/downstream readiness explicit.

Scale Requirements

Sources: 18 upstream systems (PostgreSQL, Salesforce, Stripe, S3 file drops)
Pipelines: 45 daily DAGs, ~600 tasks total
Volume: 2.5 TB/day raw data, 8B rows/day processed
SLA: Executive dashboards ready by 7:00 AM UTC
Latency: Critical finance marts available within 30 minutes of all upstream data arriving
Backfill window: Reprocess up to 180 days without corrupting downstream tables

Requirements

Design a dependency strategy for extract, stage, transform, and publish layers across multiple DAGs.
Explain how you would distinguish data dependencies from task completion dependencies.
Ensure downstream jobs do not run on partial or late upstream data.
Support idempotent reruns, partition-level backfills, and safe recovery after failures.
Define how readiness signals, dataset versioning, or partition markers should be stored and checked.
Include monitoring for blocked dependencies, SLA misses, and duplicate processing.
Show how orchestration should handle both scheduled runs and event-driven triggers from file arrivals.

Constraints

Existing stack is AWS-based and must remain in place.
Team size is 3 data engineers; operational complexity should stay low.
Finance data is SOX-auditable, so lineage and rerun history must be preserved.
Budget allows managed services already in use, but not a major platform rewrite.

Context

You are asked to redesign how workflow dependencies are modeled and coordinated so that pipelines run reliably, recover cleanly, and make upstream/downstream readiness explicit.

Scale Requirements

Sources: 18 upstream systems (PostgreSQL, Salesforce, Stripe, S3 file drops)
Pipelines: 45 daily DAGs, ~600 tasks total
Volume: 2.5 TB/day raw data, 8B rows/day processed
SLA: Executive dashboards ready by 7:00 AM UTC
Latency: Critical finance marts available within 30 minutes of all upstream data arriving
Backfill window: Reprocess up to 180 days without corrupting downstream tables

Requirements

Design a dependency strategy for extract, stage, transform, and publish layers across multiple DAGs.
Explain how you would distinguish data dependencies from task completion dependencies.
Ensure downstream jobs do not run on partial or late upstream data.
Support idempotent reruns, partition-level backfills, and safe recovery after failures.
Define how readiness signals, dataset versioning, or partition markers should be stored and checked.
Include monitoring for blocked dependencies, SLA misses, and duplicate processing.
Show how orchestration should handle both scheduled runs and event-driven triggers from file arrivals.

Constraints

Existing stack is AWS-based and must remain in place.
Team size is 3 data engineers; operational complexity should stay low.
Finance data is SOX-auditable, so lineage and rerun history must be preserved.
Budget allows managed services already in use, but not a major platform rewrite.

Interview Guides

Context

Scale Requirements

Requirements

Constraints

Coordinate Multi-Step ETL Dependencies

Context

Scale Requirements

Requirements

Constraints

Your Answer

Coordinate Multi-Step ETL Dependencies

Context

Scale Requirements

Requirements

Constraints

Coordinate Multi-Step ETL Dependencies

Context

Scale Requirements

Requirements

Constraints

Your Answer