

Data quality issues often show up as duplicate rows and missing values. In analytics and operational systems, both can distort counts, aggregations, and downstream reporting if they are not handled carefully.
Explain how you would handle duplicate records and NULL values in a dataset using SQL. Your answer should cover:
The interviewer expects a practical explanation, not just definitions. Discuss the trade-offs between deleting data, deduplicating in queries, and preventing bad data at ingestion. Use simple SQL examples and mention common mistakes, especially around COUNT, GROUP BY, and comparisons with NULL.
Duplicates are usually found by grouping on the business key rather than the physical primary key. If a combination of columns appears more than once, it may indicate repeated records that should be reviewed or consolidated.
SELECT customer_email, order_date, COUNT(*) AS duplicate_count
FROM orders
GROUP BY customer_email, order_date
HAVING COUNT(*) > 1;
When duplicates exist, you often need a rule to decide which row to keep, such as the latest timestamp or the lowest ID. The key idea is that deduplication is not only about removing extras, but also about preserving the correct version of the data.
SELECT DISTINCT customer_email, order_date
FROM orders;
NULL means missing or unknown, not an empty string or zero. Comparisons like = NULL do not work in SQL; you must use IS NULL or IS NOT NULL to test for missing values.
SELECT *
FROM customers
WHERE phone_number IS NULL;
Sometimes missing values should be replaced for reporting or display purposes. In PostgreSQL, COALESCE is commonly used to substitute a fallback value without changing the underlying stored data.
SELECT customer_name, COALESCE(city, 'Unknown') AS city
FROM customers;
Aggregate functions treat NULLs differently: COUNT(column) ignores NULLs, while COUNT(*) counts all rows. This distinction is important when measuring completeness or summarizing data accurately.
SELECT COUNT(*) AS total_rows,
COUNT(discount_code) AS rows_with_discount
FROM orders;