
In Capital One analytics workflows, especially when working with transaction, customer, or credit performance data in Snowflake- or PostgreSQL-backed reporting layers, missing values can materially change downstream metrics. Interviewers ask this to assess whether you can distinguish between data cleaning, business logic, and metric integrity.
Explain how you would handle a dataset with missing values using SQL. Your answer should cover:
NULLs, blanks, placeholder values)COALESCE and CASE WHEN help in analysisKeep the discussion practical and SQL-focused. The interviewer is not looking for advanced statistical imputation; they want a clear framework for profiling missingness, choosing an appropriate treatment based on business meaning, and implementing that logic safely in PostgreSQL.
Before changing data, quantify how much is missing and in which columns. In SQL, this usually means counting NULLs, blank strings, and known placeholder values so you understand whether the issue is isolated or systemic.
SELECT
COUNT(*) AS total_rows,
COUNT(*) FILTER (WHERE income IS NULL) AS null_income_rows,
COUNT(*) FILTER (WHERE TRIM(employment_status) = '') AS blank_status_rows
FROM customer_applications;
Not all missing values are stored as NULL. Some datasets use empty strings, zeros, or codes like 'Unknown' or 'N/A', and those should often be standardized before analysis.
SELECT
CASE
WHEN employer_name IS NULL THEN 'NULL'
WHEN TRIM(employer_name) = '' THEN 'blank'
WHEN employer_name IN ('Unknown', 'N/A') THEN 'placeholder'
ELSE 'valid'
END AS employer_name_status,
COUNT(*)
FROM customer_applications
GROUP BY employer_name_status;
You should only fill missing values when there is a defensible business rule. If a field is required for a metric, excluding incomplete rows may be more accurate than defaulting to zero and biasing the result.
SELECT AVG(monthly_income)
FROM customer_applications
WHERE monthly_income IS NOT NULL;
COALESCE is useful for replacing NULL values at query time, but the replacement must reflect business meaning. CASE WHEN is better when different missing-value patterns need different treatment or labeling.
SELECT
application_id,
COALESCE(reported_bonus, 0) AS reported_bonus,
CASE
WHEN debt_to_income_ratio IS NULL THEN 'missing'
WHEN debt_to_income_ratio > 0.4 THEN 'high'
ELSE 'normal'
END AS dti_bucket
FROM customer_applications;
Missing values affect metrics differently depending on the function. AVG(column) ignores NULLs, while replacing NULL with 0 changes the denominator and can materially understate or overstate a financial KPI.
SELECT
AVG(monthly_income) AS avg_excluding_nulls,
AVG(COALESCE(monthly_income, 0)) AS avg_treating_missing_as_zero
FROM customer_applications;