
A
A
Teams often build weekly performance reports from operational tables that contain nulls, blank strings, duplicate rows, and inconsistent status values. A good analyst or data engineer should be able to explain how they would make the report reliable before calculating metrics.
How would you handle missing or messy data when preparing a weekly performance report in SQL? In your answer, explain:
The interviewer is looking for a practical SQL-focused explanation, not just general data cleaning advice. Discuss the logic you would apply in PostgreSQL, the trade-offs of imputing versus excluding records, and how your choices affect weekly aggregates such as counts, sums, and averages.
Before writing the final report query, inspect the source data for nulls, blank strings, unexpected categories, out-of-range values, and duplicate records. This helps you decide whether to exclude, standardize, or default problematic values instead of letting bad data silently distort weekly metrics.
SELECT
COUNT(*) AS total_rows,
COUNT(*) FILTER (WHERE metric_value IS NULL) AS null_metric_rows,
COUNT(*) FILTER (WHERE TRIM(COALESCE(status, '')) = '') AS blank_status_rows
FROM weekly_events;
Messy text fields often contain inconsistent casing, extra spaces, or multiple labels for the same meaning. Using functions like TRIM, LOWER, UPPER, and CASE ensures that values are grouped consistently before aggregation.
SELECT
CASE
WHEN LOWER(TRIM(status)) IN ('complete', 'completed', 'done') THEN 'completed'
WHEN LOWER(TRIM(status)) IN ('pending', 'in progress') THEN 'pending'
ELSE 'unknown'
END AS clean_status
FROM weekly_events;
Missing numeric values should not always be replaced with zero because zero means a real measurement, while NULL means unknown. Use COALESCE only when the business meaning is clear, and use CASE to exclude negative or invalid values from sums and averages when they represent bad data.
SELECT
AVG(CASE WHEN metric_value >= 0 THEN metric_value END) AS avg_valid_metric,
SUM(COALESCE(tasks_completed, 0)) AS total_tasks
FROM weekly_events;
Duplicate rows can inflate weekly counts and totals. Deduplication should be based on a business key, such as employee_id plus report_date plus task_id, rather than removing rows blindly with DISTINCT on all columns.
SELECT DISTINCT employee_id, report_date, task_id, metric_value
FROM weekly_events;
A strong weekly report clearly documents how excluded or cleaned records were handled. Including counts of dropped, defaulted, or standardized rows makes the report more trustworthy and easier to debug when numbers change week to week.
SELECT
COUNT(*) AS total_rows,
COUNT(*) FILTER (WHERE metric_value < 0) AS invalid_rows,
COUNT(*) FILTER (WHERE metric_value IS NULL) AS missing_rows
FROM weekly_events;