Handle Missing Values in ETL

Context

Northstar Health, a mid-sized healthcare analytics company, ingests daily CSV and JSON extracts from clinics, billing systems, and patient engagement apps into a Snowflake warehouse. Analysts currently discover missing values only after dashboards break or downstream models drift, so the data team needs a reliable pipeline that detects, classifies, and handles nulls before publication.

The existing stack uses Amazon S3 for landing, Apache Airflow for orchestration, Python-based validation jobs, and dbt for warehouse transformations. You need to design a production-ready batch pipeline for missing-value handling that preserves auditability and supports different treatment strategies by column.

Scale Requirements

Sources: 18 upstream systems, daily batch loads
Volume: 220 million records/day, ~450 GB raw compressed data
Table count: 35 raw tables, 12 curated marts
Latency SLA: Curated tables available by 7:00 AM UTC; batch window is 90 minutes
Retention: 2 years raw, 5 years curated + audit logs

Requirements

Ingest source files from S3 and standardize schemas before quality checks.
Detect missing values at field, row, and file level, including empty strings, malformed placeholders ("N/A", "unknown", "-"), and true SQL NULLs.
Apply configurable handling rules by column, such as dropping rows, default substitution, forward fill, median imputation, or routing to quarantine.
Preserve original raw values and record every imputation decision for audit and backfill.
Publish data quality metrics and fail the pipeline when critical columns exceed allowed null thresholds.
Load clean data into Snowflake and build dbt models only if quality gates pass.

Constraints

AWS-first environment; no new always-on infrastructure
HIPAA-sensitive data; PHI must remain encrypted at rest and in transit
Small team: 3 data engineers, 1 analytics engineer
Monthly incremental spend target: under $12K
Pipeline must support idempotent reruns and selective backfills by date/source

Context

Scale Requirements

Sources: 18 upstream systems, daily batch loads
Volume: 220 million records/day, ~450 GB raw compressed data
Table count: 35 raw tables, 12 curated marts
Latency SLA: Curated tables available by 7:00 AM UTC; batch window is 90 minutes
Retention: 2 years raw, 5 years curated + audit logs

Requirements

Ingest source files from S3 and standardize schemas before quality checks.
Detect missing values at field, row, and file level, including empty strings, malformed placeholders ("N/A", "unknown", "-"), and true SQL NULLs.
Apply configurable handling rules by column, such as dropping rows, default substitution, forward fill, median imputation, or routing to quarantine.
Preserve original raw values and record every imputation decision for audit and backfill.
Publish data quality metrics and fail the pipeline when critical columns exceed allowed null thresholds.
Load clean data into Snowflake and build dbt models only if quality gates pass.

Constraints

AWS-first environment; no new always-on infrastructure
HIPAA-sensitive data; PHI must remain encrypted at rest and in transit
Small team: 3 data engineers, 1 analytics engineer
Monthly incremental spend target: under $12K
Pipeline must support idempotent reruns and selective backfills by date/source

Context

Scale Requirements

Sources: 18 upstream systems, daily batch loads
Volume: 220 million records/day, ~450 GB raw compressed data
Table count: 35 raw tables, 12 curated marts
Latency SLA: Curated tables available by 7:00 AM UTC; batch window is 90 minutes
Retention: 2 years raw, 5 years curated + audit logs

Requirements

Ingest source files from S3 and standardize schemas before quality checks.
Detect missing values at field, row, and file level, including empty strings, malformed placeholders ("N/A", "unknown", "-"), and true SQL NULLs.
Apply configurable handling rules by column, such as dropping rows, default substitution, forward fill, median imputation, or routing to quarantine.
Preserve original raw values and record every imputation decision for audit and backfill.
Publish data quality metrics and fail the pipeline when critical columns exceed allowed null thresholds.
Load clean data into Snowflake and build dbt models only if quality gates pass.

Constraints

AWS-first environment; no new always-on infrastructure
HIPAA-sensitive data; PHI must remain encrypted at rest and in transit
Small team: 3 data engineers, 1 analytics engineer
Monthly incremental spend target: under $12K
Pipeline must support idempotent reruns and selective backfills by date/source

Context

Scale Requirements

Sources: 18 upstream systems, daily batch loads
Volume: 220 million records/day, ~450 GB raw compressed data
Table count: 35 raw tables, 12 curated marts
Latency SLA: Curated tables available by 7:00 AM UTC; batch window is 90 minutes
Retention: 2 years raw, 5 years curated + audit logs

Requirements

Ingest source files from S3 and standardize schemas before quality checks.
Detect missing values at field, row, and file level, including empty strings, malformed placeholders ("N/A", "unknown", "-"), and true SQL NULLs.
Apply configurable handling rules by column, such as dropping rows, default substitution, forward fill, median imputation, or routing to quarantine.
Preserve original raw values and record every imputation decision for audit and backfill.
Publish data quality metrics and fail the pipeline when critical columns exceed allowed null thresholds.
Load clean data into Snowflake and build dbt models only if quality gates pass.

Constraints

AWS-first environment; no new always-on infrastructure
HIPAA-sensitive data; PHI must remain encrypted at rest and in transit
Small team: 3 data engineers, 1 analytics engineer
Monthly incremental spend target: under $12K
Pipeline must support idempotent reruns and selective backfills by date/source

Interview Guides

Context

Scale Requirements

Requirements

Constraints

Handle Missing Values in ETL

Context

Scale Requirements

Requirements

Constraints

Your Answer

Handle Missing Values in ETL

Context

Scale Requirements

Requirements

Constraints

Handle Missing Values in ETL

Context

Scale Requirements

Requirements

Constraints

Your Answer