Context
FinEdge, a mid-size fintech company, runs daily ETL pipelines to produce finance and risk reporting in Snowflake. The current Airflow setup has grown organically: upstream extracts, staging loads, dimension builds, and downstream marts are connected with ad hoc task dependencies, causing missed SLAs, duplicate runs, and confusion during upstream delays.
You are asked to redesign how workflow dependencies are modeled and coordinated so that pipelines run reliably, recover cleanly, and make upstream/downstream readiness explicit.
Scale Requirements
- Sources: 18 upstream systems (PostgreSQL, Salesforce, Stripe, S3 file drops)
- Pipelines: 45 daily DAGs, ~600 tasks total
- Volume: 2.5 TB/day raw data, 8B rows/day processed
- SLA: Executive dashboards ready by 7:00 AM UTC
- Latency: Critical finance marts available within 30 minutes of all upstream data arriving
- Backfill window: Reprocess up to 180 days without corrupting downstream tables
Requirements
- Design a dependency strategy for extract, stage, transform, and publish layers across multiple DAGs.
- Explain how you would distinguish data dependencies from task completion dependencies.
- Ensure downstream jobs do not run on partial or late upstream data.
- Support idempotent reruns, partition-level backfills, and safe recovery after failures.
- Define how readiness signals, dataset versioning, or partition markers should be stored and checked.
- Include monitoring for blocked dependencies, SLA misses, and duplicate processing.
- Show how orchestration should handle both scheduled runs and event-driven triggers from file arrivals.
Constraints
- Existing stack is AWS-based and must remain in place.
- Team size is 3 data engineers; operational complexity should stay low.
- Finance data is SOX-auditable, so lineage and rerun history must be preserved.
- Budget allows managed services already in use, but not a major platform rewrite.