Interview Guides

Handling Missing Demographic Data

Easy

SQL & Data Manipulation

Asked at 15 companies15SubqueriesCase WhenData Wrangling

Also asked at

Problem

Context

At companies like BrightCart, demographic fields such as age group, gender, and region are often used in reporting and segmentation. If 30% of critical demographic data is missing, analysts need a practical SQL-based approach that preserves data quality without introducing misleading assumptions.

Question

Explain how you would handle a dataset where 30% of the critical demographic data is missing. In your answer, discuss:

How you would measure and profile the missingness in SQL
When to exclude rows versus keep them with an Unknown category
How CASE WHEN, COALESCE, and basic aggregations can help with reporting
What risks missing demographic data creates for downstream analysis

Scope guidance

The interviewer expects a practical, SQL-oriented explanation rather than advanced statistical imputation. Focus on simple, defensible approaches that are appropriate for reporting and exploratory analysis in PostgreSQL.

Key Concepts

Profiling missingness

The first step is to quantify how much data is missing in each critical column. In SQL, this is usually done with COUNT(*), conditional aggregation, and percentage calculations so you can see whether the issue is isolated or widespread.

SELECT
  COUNT(*) AS total_rows,
  COUNT(*) FILTER (WHERE gender IS NULL) AS missing_gender,
  ROUND(100.0 * COUNT(*) FILTER (WHERE gender IS NULL) / COUNT(*), 2) AS pct_missing_gender
FROM customers;

Unknown category instead of dropping rows

If the missing field is needed for grouped reporting, replacing NULL with an explicit Unknown bucket often preserves row counts and makes data quality visible. This is usually better than silently excluding a large share of records from dashboards.

SELECT
  COALESCE(region, 'Unknown') AS region_group,
  COUNT(*) AS customer_count
FROM customers
GROUP BY COALESCE(region, 'Unknown');

Selective exclusion

Rows should only be excluded when the analysis requires a known demographic value and using unknowns would invalidate the result. For example, a gender-specific breakdown may reasonably filter to rows where gender is present, but the analyst should report the excluded percentage.

SELECT
  gender,
  COUNT(*) AS customer_count
FROM customers
WHERE gender IS NOT NULL
GROUP BY gender;

Conditional labeling with CASE

A CASE WHEN expression is useful when missing values need more nuanced treatment, such as separating blank strings from true NULLs or flagging records for review. This helps standardize messy source data before aggregation.

SELECT
  CASE
    WHEN age_group IS NULL THEN 'Missing'
    WHEN age_group = '' THEN 'Blank'
    ELSE age_group
  END AS age_group_status,
  COUNT(*)
FROM customers
GROUP BY CASE
  WHEN age_group IS NULL THEN 'Missing'
  WHEN age_group = '' THEN 'Blank'
  ELSE age_group
END;

Bias and downstream impact

Missing demographic data is not just a formatting issue; it can bias segment-level metrics if the missing rows are systematically different from complete rows. A good SQL answer should mention that completeness should be monitored alongside business metrics.

SELECT
  CASE WHEN region IS NULL THEN 'Missing region' ELSE 'Known region' END AS completeness_group,
  AVG(spend) AS avg_spend,
  COUNT(*) AS row_count
FROM customers
GROUP BY CASE WHEN region IS NULL THEN 'Missing region' ELSE 'Known region' END;

Problem

Context

Question

Explain how you would handle a dataset where 30% of the critical demographic data is missing. In your answer, discuss:

How you would measure and profile the missingness in SQL
When to exclude rows versus keep them with an Unknown category
How CASE WHEN, COALESCE, and basic aggregations can help with reporting
What risks missing demographic data creates for downstream analysis

Scope guidance

Key Concepts

Profiling missingness

SELECT
  COUNT(*) AS total_rows,
  COUNT(*) FILTER (WHERE gender IS NULL) AS missing_gender,
  ROUND(100.0 * COUNT(*) FILTER (WHERE gender IS NULL) / COUNT(*), 2) AS pct_missing_gender
FROM customers;

Unknown category instead of dropping rows

SELECT
  COALESCE(region, 'Unknown') AS region_group,
  COUNT(*) AS customer_count
FROM customers
GROUP BY COALESCE(region, 'Unknown');

Selective exclusion

SELECT
  gender,
  COUNT(*) AS customer_count
FROM customers
WHERE gender IS NOT NULL
GROUP BY gender;

Conditional labeling with CASE

SELECT
  CASE
    WHEN age_group IS NULL THEN 'Missing'
    WHEN age_group = '' THEN 'Blank'
    ELSE age_group
  END AS age_group_status,
  COUNT(*)
FROM customers
GROUP BY CASE
  WHEN age_group IS NULL THEN 'Missing'
  WHEN age_group = '' THEN 'Blank'
  ELSE age_group
END;

Bias and downstream impact

SELECT
  CASE WHEN region IS NULL THEN 'Missing region' ELSE 'Known region' END AS completeness_group,
  AVG(spend) AS avg_spend,
  COUNT(*) AS row_count
FROM customers
GROUP BY CASE WHEN region IS NULL THEN 'Missing region' ELSE 'Known region' END;

Your answer

Try one AI text evaluation on us

Get structured feedback, scored against a 4-axis rubric. Premium unlocks unlimited.

0 wordstarget ~200

Up next

Handling Missing Data in SQLEasy

Handling Missing Values in SQLEasy

Handling Missing Values in SQLEasy

Next question

Handling Missing Demographic Data

Easy

SQL & Data Manipulation

Asked at 15 companies15SubqueriesCase WhenData Wrangling

Also asked at

Problem

Context

Question

Explain how you would handle a dataset where 30% of the critical demographic data is missing. In your answer, discuss:

How you would measure and profile the missingness in SQL
When to exclude rows versus keep them with an Unknown category
How CASE WHEN, COALESCE, and basic aggregations can help with reporting
What risks missing demographic data creates for downstream analysis

Scope guidance

Key Concepts

Profiling missingness

SELECT
  COUNT(*) AS total_rows,
  COUNT(*) FILTER (WHERE gender IS NULL) AS missing_gender,
  ROUND(100.0 * COUNT(*) FILTER (WHERE gender IS NULL) / COUNT(*), 2) AS pct_missing_gender
FROM customers;

Unknown category instead of dropping rows

SELECT
  COALESCE(region, 'Unknown') AS region_group,
  COUNT(*) AS customer_count
FROM customers
GROUP BY COALESCE(region, 'Unknown');

Selective exclusion

SELECT
  gender,
  COUNT(*) AS customer_count
FROM customers
WHERE gender IS NOT NULL
GROUP BY gender;

Conditional labeling with CASE

SELECT
  CASE
    WHEN age_group IS NULL THEN 'Missing'
    WHEN age_group = '' THEN 'Blank'
    ELSE age_group
  END AS age_group_status,
  COUNT(*)
FROM customers
GROUP BY CASE
  WHEN age_group IS NULL THEN 'Missing'
  WHEN age_group = '' THEN 'Blank'
  ELSE age_group
END;

Bias and downstream impact

SELECT
  CASE WHEN region IS NULL THEN 'Missing region' ELSE 'Known region' END AS completeness_group,
  AVG(spend) AS avg_spend,
  COUNT(*) AS row_count
FROM customers
GROUP BY CASE WHEN region IS NULL THEN 'Missing region' ELSE 'Known region' END;

Problem

Context

Question

Explain how you would handle a dataset where 30% of the critical demographic data is missing. In your answer, discuss:

How you would measure and profile the missingness in SQL
When to exclude rows versus keep them with an Unknown category
How CASE WHEN, COALESCE, and basic aggregations can help with reporting
What risks missing demographic data creates for downstream analysis

Scope guidance

Key Concepts

Profiling missingness

SELECT
  COUNT(*) AS total_rows,
  COUNT(*) FILTER (WHERE gender IS NULL) AS missing_gender,
  ROUND(100.0 * COUNT(*) FILTER (WHERE gender IS NULL) / COUNT(*), 2) AS pct_missing_gender
FROM customers;

Unknown category instead of dropping rows

SELECT
  COALESCE(region, 'Unknown') AS region_group,
  COUNT(*) AS customer_count
FROM customers
GROUP BY COALESCE(region, 'Unknown');

Selective exclusion

SELECT
  gender,
  COUNT(*) AS customer_count
FROM customers
WHERE gender IS NOT NULL
GROUP BY gender;

Conditional labeling with CASE

SELECT
  CASE
    WHEN age_group IS NULL THEN 'Missing'
    WHEN age_group = '' THEN 'Blank'
    ELSE age_group
  END AS age_group_status,
  COUNT(*)
FROM customers
GROUP BY CASE
  WHEN age_group IS NULL THEN 'Missing'
  WHEN age_group = '' THEN 'Blank'
  ELSE age_group
END;

Bias and downstream impact

SELECT
  CASE WHEN region IS NULL THEN 'Missing region' ELSE 'Known region' END AS completeness_group,
  AVG(spend) AS avg_spend,
  COUNT(*) AS row_count
FROM customers
GROUP BY CASE WHEN region IS NULL THEN 'Missing region' ELSE 'Known region' END;

Your answer

Try one AI text evaluation on us

Get structured feedback, scored against a 4-axis rubric. Premium unlocks unlimited.

0 wordstarget ~200

Up next

Handling Missing Data in SQLEasy

Handling Missing Values in SQLEasy

Next question

Handling Missing Demographic Data | Dataford Interview Questions - Dataford - Ace your Interview