Analytics datasets often contain missing values, blank strings, duplicate records, and inconsistent categories. If these issues are not handled carefully, reports and downstream metrics become unreliable.
Explain how you would handle NULL values and dirty data when preparing a dataset for analytics in SQL. Your answer should cover:
NULLs from other dirty values such as empty strings, placeholder text, or invalid numbersCOALESCE, NULLIF, CASE WHEN, and aggregates behave with missing dataKeep the discussion practical and SQL-focused. The interviewer expects you to explain both the reasoning and the trade-offs, not just list functions. Use short PostgreSQL examples where helpful.
Missing data is not always stored as SQL NULL. In real tables, you may also see empty strings, whitespace, placeholder values like 'unknown', or invalid numeric values such as -1. A good first step is to normalize these representations so analytics logic treats them consistently.
SELECT NULLIF(TRIM(email), '') AS cleaned_email
FROM customers;
Cleaning should happen before calculating metrics. If categories like 'NY', 'New York', and 'new york ' are not standardized first, grouped results will be split incorrectly and totals will be misleading.
SELECT CASE
WHEN LOWER(TRIM(state)) IN ('ny', 'new york') THEN 'New York'
ELSE state
END AS standardized_state
FROM customers;
COALESCE replaces NULL with a fallback value, but it should only be used when the replacement is analytically valid. Replacing missing revenue with 0 may be appropriate in some reporting contexts, but replacing missing age with 0 usually distorts the data.
SELECT COALESCE(discount_amount, 0) AS discount_amount
FROM orders;
Most SQL aggregates ignore NULL values. For example, AVG(score) excludes null rows from both the numerator and denominator, while COUNT(*) counts all rows and COUNT(score) counts only non-null scores. Candidates should understand how this changes reported metrics.
SELECT COUNT(*) AS total_rows,
COUNT(score) AS non_null_scores,
AVG(score) AS avg_score
FROM survey_responses;
There is no single correct cleaning rule for all columns. The right approach depends on business meaning: some records should be excluded, some standardized, and some flagged for downstream review. Strong answers connect technical choices to metric accuracy and stakeholder expectations.
SELECT CASE
WHEN amount < 0 THEN NULL
ELSE amount
END AS cleaned_amount
FROM payments;