



BAt companies like BrightCart, demographic fields such as age group, gender, and region are often used in reporting and segmentation. If 30% of critical demographic data is missing, analysts need a practical SQL-based approach that preserves data quality without introducing misleading assumptions.
Explain how you would handle a dataset where 30% of the critical demographic data is missing. In your answer, discuss:
Unknown categoryCASE WHEN, COALESCE, and basic aggregations can help with reportingThe interviewer expects a practical, SQL-oriented explanation rather than advanced statistical imputation. Focus on simple, defensible approaches that are appropriate for reporting and exploratory analysis in PostgreSQL.
The first step is to quantify how much data is missing in each critical column. In SQL, this is usually done with COUNT(*), conditional aggregation, and percentage calculations so you can see whether the issue is isolated or widespread.
SELECT
COUNT(*) AS total_rows,
COUNT(*) FILTER (WHERE gender IS NULL) AS missing_gender,
ROUND(100.0 * COUNT(*) FILTER (WHERE gender IS NULL) / COUNT(*), 2) AS pct_missing_gender
FROM customers;
If the missing field is needed for grouped reporting, replacing NULL with an explicit Unknown bucket often preserves row counts and makes data quality visible. This is usually better than silently excluding a large share of records from dashboards.
SELECT
COALESCE(region, 'Unknown') AS region_group,
COUNT(*) AS customer_count
FROM customers
GROUP BY COALESCE(region, 'Unknown');
Rows should only be excluded when the analysis requires a known demographic value and using unknowns would invalidate the result. For example, a gender-specific breakdown may reasonably filter to rows where gender is present, but the analyst should report the excluded percentage.
SELECT
gender,
COUNT(*) AS customer_count
FROM customers
WHERE gender IS NOT NULL
GROUP BY gender;
A CASE WHEN expression is useful when missing values need more nuanced treatment, such as separating blank strings from true NULLs or flagging records for review. This helps standardize messy source data before aggregation.
SELECT
CASE
WHEN age_group IS NULL THEN 'Missing'
WHEN age_group = '' THEN 'Blank'
ELSE age_group
END AS age_group_status,
COUNT(*)
FROM customers
GROUP BY CASE
WHEN age_group IS NULL THEN 'Missing'
WHEN age_group = '' THEN 'Blank'
ELSE age_group
END;
Missing demographic data is not just a formatting issue; it can bias segment-level metrics if the missing rows are systematically different from complete rows. A good SQL answer should mention that completeness should be monitored alongside business metrics.
SELECT
CASE WHEN region IS NULL THEN 'Missing region' ELSE 'Known region' END AS completeness_group,
AVG(spend) AS avg_spend,
COUNT(*) AS row_count
FROM customers
GROUP BY CASE WHEN region IS NULL THEN 'Missing region' ELSE 'Known region' END;