Context
FinPulse, a B2B payments analytics company, runs nightly ETL pipelines that ingest transaction files from S3, validate them, transform them with Spark, and load curated tables into Snowflake. The current Apache Airflow setup retries failed tasks with a fixed delay, causing duplicate loads, wasted compute, and long recovery times when failures are transient.
You need to design a retry strategy for failed pipeline tasks that improves reliability without corrupting downstream data. The solution should cover transient infrastructure failures, external API timeouts, and data-quality-related failures that should not be retried automatically.
Scale Requirements
- Pipelines: 180 scheduled DAG runs per day
- Tasks: ~2,500 task executions/day across ingestion, validation, transform, and load stages
- Input volume: 4 TB/day of Parquet and CSV files in S3
- Latency target: Critical finance datasets available by 6:00 AM UTC
- Failure rate: 3-5% transient task failure rate during peak load
- Recovery objective: Retryable failures should recover within 15 minutes without manual intervention
- Retention: Task execution logs and retry metadata retained for 90 days
Requirements
- Design task-level retry policies for different failure classes: transient, dependency, and deterministic data errors.
- Prevent duplicate writes when retries occur, especially for Snowflake loads and partitioned S3 outputs.
- Define how retry metadata, attempt counts, and terminal failure reasons are stored and exposed.
- Include orchestration logic for exponential backoff, max retry limits, and escalation to dead-letter or manual review paths.
- Describe monitoring and alerting for retry storms, SLA risk, and repeated task failure patterns.
- Explain how downstream dependencies should behave when an upstream task succeeds after retry.
Constraints
- Existing stack is AWS + Airflow + Spark + Snowflake; avoid introducing a second orchestrator.
- Incremental cloud spend must stay under $8K/month.
- Finance data is SOX-audited, so every retry decision must be traceable.
- Some source systems expose rate-limited REST APIs (429s common during month-end close).