Context
DataForge, a B2B analytics company, runs 8,000 daily batch and streaming tasks across Apache Airflow, dbt, and custom Spark jobs. Today, task dependencies are split across DAG code, SQL models, and ad hoc metadata tables, making it difficult to understand upstream/downstream impact, safely backfill data, or recover from failures without manual coordination.
You need to design a centralized dependency tracking system that records task relationships, execution state, lineage, and readiness for scheduling decisions. The system should support both static dependencies (declared in code) and dynamic dependencies discovered at runtime from dataset partitions and event triggers.
Scale Requirements
- Tasks: 120K unique task definitions across environments
- Executions: 15M task runs/day, peak 4K state updates/sec
- Dependency edges: 2.5M static edges, up to 20M partition-level dynamic edges/day
- Latency: dependency resolution for a scheduling decision in < 500 ms P95
- Retention: 13 months of execution metadata, 90 days of partition-level dependency detail
Requirements
- Build a metadata model for tasks, task versions, dependency edges, execution runs, and partition-level readiness.
- Support DAG-style dependencies, dataset-based dependencies, conditional triggers, and cross-platform dependencies between Airflow, dbt, and Spark.
- Detect cycles at definition time and prevent invalid dependency registration.
- Expose APIs for registering dependencies, updating task states, querying upstream/downstream lineage, and determining whether a task is runnable.
- Support idempotent event ingestion, replay, and backfill of historical runs without corrupting dependency state.
- Provide failure recovery, auditability, and observability for stale dependencies, missing upstreams, and out-of-order events.
- Design for multi-tenant isolation across dev, staging, and prod.
Constraints
- Primary cloud is AWS; existing stack includes Airflow 2.x, Spark on EMR, dbt Core, Kafka, and PostgreSQL.
- Team size is 5 engineers; operational complexity must stay moderate.
- Compliance requires immutable audit logs for task state changes and user-initiated re-runs.
- Budget does not allow a large graph database cluster unless clearly justified.