Context
Northstar Health, a digital healthcare analytics company, runs a mixed batch and streaming data platform on AWS. Source data includes PostgreSQL transactional databases, Fivetran SaaS connectors, and Kafka topics feeding a Snowflake warehouse. The current issue is not ingestion itself, but weak observability: failed jobs, stale tables, and silent data quality regressions are often discovered by analysts instead of the platform team.
You need to design a monitoring, alerting, and notification framework for the platform that gives engineers and data consumers clear visibility into pipeline health, freshness, and data quality.
Scale Requirements
- Pipelines: 180 scheduled Airflow DAGs and 12 streaming jobs
- Data volume: ~9 TB/day ingested into S3 and Snowflake
- Tables monitored: 1,200 warehouse tables, 150 business-critical
- Latency targets: batch SLA 30 minutes after schedule; streaming SLA < 2 minutes lag
- Users: 40 internal analysts, 12 data engineers, 6 downstream ML jobs
- Retention: 13 months of operational metrics and audit logs
Requirements
- Monitor orchestration health for Airflow DAGs, task retries, SLA misses, and dependency failures.
- Monitor streaming health for Kafka consumer lag, job restarts, and end-to-end latency.
- Implement data quality checks for row-count anomalies, null spikes, schema drift, duplicate keys, and freshness.
- Route alerts by severity: Slack for warnings, PagerDuty for production-critical failures, and email for daily summaries.
- Provide dashboards for engineers and business users showing pipeline status, table freshness, and incident history.
- Support alert deduplication, escalation, and suppression during planned maintenance windows.
- Store monitoring events in a queryable system for trend analysis and postmortems.
Constraints
- AWS-first environment; avoid introducing more than two new managed services.
- HIPAA-sensitive metadata: no raw PHI in logs, alerts, or dashboards.
- Small team: 3 data engineers and 1 platform engineer.
- Incremental budget cap: $12K/month.
- Existing stack should remain: Airflow, Kafka, dbt, Snowflake, S3, and CloudWatch.