Context
FinSight, a B2B fintech analytics company, runs scheduled ETL pipelines that ingest payment, ledger, and customer-support data from PostgreSQL, Stripe APIs, and S3 into Snowflake. The current stack works functionally, but operators only discover failures from downstream analysts, and there is no unified view of pipeline health, data freshness, or data quality.
You need to design an observability layer for this data platform so the team can detect, triage, and recover from failures quickly across ingestion, transformation, and warehouse loading.
Scale Requirements
- Pipelines: 180 Airflow DAGs, ~1,200 task runs/hour during peak windows
- Data volume: 4 TB/day raw ingest, 25 TB compressed retained in S3 data lake
- Latency targets: Critical finance tables refreshed within 15 minutes; non-critical tables within 4 hours
- Failure budget: <0.5% failed task runs/day for tier-1 pipelines
- Team: 5 data engineers, 1 platform engineer, on-call rotation shared across teams
Requirements
- Design end-to-end observability for batch ETL and near-real-time ingestion pipelines.
- Capture infrastructure, pipeline, and data-quality signals in one monitoring model.
- Define SLIs/SLOs for freshness, completeness, task success rate, and schema stability.
- Support root-cause analysis for failures such as upstream API outages, schema drift, duplicate loads, and slow warehouse jobs.
- Provide alerting that reduces noise and routes incidents by severity and pipeline tier.
- Include metadata and lineage so operators can identify impacted downstream tables and dashboards.
- Show how you would instrument Airflow, dbt, Spark jobs, and Snowflake loads.
Constraints
- Existing infrastructure is AWS-based and must remain: Airflow 2.x on EKS, Spark on EMR, S3, Snowflake, dbt Core
- Incremental budget is capped at $12K/month for observability tooling
- Financial datasets require auditability and 1-year retention of operational logs
- The team prefers open standards where possible and wants to avoid building a custom monitoring platform from scratch
- PII must not be emitted into logs, traces, or alert payloads