Context
Northstar Health, a mid-sized healthcare analytics company, ingests daily CSV and JSON extracts from clinics, billing systems, and patient engagement apps into a Snowflake warehouse. Analysts currently discover missing values only after dashboards break or downstream models drift, so the data team needs a reliable pipeline that detects, classifies, and handles nulls before publication.
The existing stack uses Amazon S3 for landing, Apache Airflow for orchestration, Python-based validation jobs, and dbt for warehouse transformations. You need to design a production-ready batch pipeline for missing-value handling that preserves auditability and supports different treatment strategies by column.
Scale Requirements
- Sources: 18 upstream systems, daily batch loads
- Volume: 220 million records/day, ~450 GB raw compressed data
- Table count: 35 raw tables, 12 curated marts
- Latency SLA: Curated tables available by 7:00 AM UTC; batch window is 90 minutes
- Retention: 2 years raw, 5 years curated + audit logs
Requirements
- Ingest source files from S3 and standardize schemas before quality checks.
- Detect missing values at field, row, and file level, including empty strings, malformed placeholders (
"N/A", "unknown", "-"), and true SQL NULLs.
- Apply configurable handling rules by column, such as dropping rows, default substitution, forward fill, median imputation, or routing to quarantine.
- Preserve original raw values and record every imputation decision for audit and backfill.
- Publish data quality metrics and fail the pipeline when critical columns exceed allowed null thresholds.
- Load clean data into Snowflake and build dbt models only if quality gates pass.
Constraints
- AWS-first environment; no new always-on infrastructure
- HIPAA-sensitive data; PHI must remain encrypted at rest and in transit
- Small team: 3 data engineers, 1 analytics engineer
- Monthly incremental spend target: under $12K
- Pipeline must support idempotent reruns and selective backfills by date/source