Context
FinSight, a personal finance analytics company, runs daily ETL pipelines that ingest transaction, account, and customer data from PostgreSQL, Stripe, and S3 into Snowflake for internal reporting. The current Airflow-based pipeline completes inconsistently, and the data team lacks clear visibility into runtime regressions, failed tasks, stale tables, and data quality issues.
You are asked to design a monitoring strategy for this pipeline so engineers can quickly detect performance degradation, data freshness problems, and quality failures before business users see incorrect dashboards.
Scale Requirements
- Sources: 3 primary sources (PostgreSQL OLTP, Stripe API, S3 partner files)
- Volume: 250M rows/day, ~1.2 TB raw data/day
- Pipeline cadence: Daily full load for some tables, hourly incremental loads for high-value fact tables
- Latency target: Hourly tables available in Snowflake within 20 minutes of scheduled start
- SLA: 99.5% successful scheduled runs per month
- Retention: 180 days of raw logs and 13 months of warehouse history
Requirements
- Define how to monitor pipeline performance across ingestion, transformation, and load stages.
- Track task runtime, queue delays, retry counts, row counts, and end-to-end freshness.
- Detect data quality issues such as null spikes, duplicate primary keys, and schema drift.
- Propose alerting thresholds and escalation paths for failures and slowdowns.
- Show how monitoring integrates with Airflow, dbt, and Snowflake.
- Include a lightweight approach for historical trend analysis and SLA reporting.
Constraints
- Existing stack is AWS + Snowflake; avoid introducing a large new platform.
- Team size is 3 data engineers; operational overhead must stay low.
- Budget allows managed monitoring tools but not a full observability rebuild.
- PCI-related financial data requires auditability of failures, retries, and manual reruns.