Handling Duplicates and Nulls in SQL

Question

Explain how you would handle duplicate records and NULL values in a dataset using SQL. Your answer should cover:

How to identify duplicates

How to remove or retain the correct record when duplicates exist

How NULL values affect filtering, grouping, and aggregations

Common SQL techniques for replacing, excluding, or preserving NULLs depending on the use case

Identifying duplicate records

Duplicates are usually found by grouping on the business key rather than the physical primary key. If a combination of columns appears more than once, it may indicate repeated records that should be reviewed or consolidated.

SELECT customer_email, order_date, COUNT(*) AS duplicate_count
FROM orders
GROUP BY customer_email, order_date
HAVING COUNT(*) > 1;

Deduplicating while keeping one record

When duplicates exist, you often need a rule to decide which row to keep, such as the latest timestamp or the lowest ID. The key idea is that deduplication is not only about removing extras, but also about preserving the correct version of the data.

SELECT DISTINCT customer_email, order_date
FROM orders;

NULL behavior in SQL

NULL means missing or unknown, not an empty string or zero. Comparisons like = NULL do not work in SQL; you must use IS NULL or IS NOT NULL to test for missing values.

SELECT *
FROM customers
WHERE phone_number IS NULL;

Replacing NULLs safely

Sometimes missing values should be replaced for reporting or display purposes. In PostgreSQL, COALESCE is commonly used to substitute a fallback value without changing the underlying stored data.

SELECT customer_name, COALESCE(city, 'Unknown') AS city
FROM customers;

Impact on aggregations

Aggregate functions treat NULLs differently: COUNT(column) ignores NULLs, while COUNT(*) counts all rows. This distinction is important when measuring completeness or summarizing data accurately.

SELECT COUNT(*) AS total_rows,
       COUNT(discount_code) AS rows_with_discount
FROM orders;

Question

Explain how you would handle duplicate records and NULL values in a dataset using SQL. Your answer should cover:

How to identify duplicates

How to remove or retain the correct record when duplicates exist

How NULL values affect filtering, grouping, and aggregations

Common SQL techniques for replacing, excluding, or preserving NULLs depending on the use case

Identifying duplicate records

SELECT customer_email, order_date, COUNT(*) AS duplicate_count
FROM orders
GROUP BY customer_email, order_date
HAVING COUNT(*) > 1;

Deduplicating while keeping one record

SELECT DISTINCT customer_email, order_date
FROM orders;

NULL behavior in SQL

NULL means missing or unknown, not an empty string or zero. Comparisons like = NULL do not work in SQL; you must use IS NULL or IS NOT NULL to test for missing values.

SELECT *
FROM customers
WHERE phone_number IS NULL;

Replacing NULLs safely

Sometimes missing values should be replaced for reporting or display purposes. In PostgreSQL, COALESCE is commonly used to substitute a fallback value without changing the underlying stored data.

SELECT customer_name, COALESCE(city, 'Unknown') AS city
FROM customers;

Impact on aggregations

Aggregate functions treat NULLs differently: COUNT(column) ignores NULLs, while COUNT(*) counts all rows. This distinction is important when measuring completeness or summarizing data accurately.

SELECT COUNT(*) AS total_rows,
       COUNT(discount_code) AS rows_with_discount
FROM orders;

Problem

Context

Question

Scope Guidance

Key Concepts

Identifying duplicate records

Deduplicating while keeping one record

NULL behavior in SQL

Replacing NULLs safely

Impact on aggregations

Handling Duplicates and Nulls in SQL

Problem

Context

Question

Scope Guidance

Key Concepts

Identifying duplicate records

Deduplicating while keeping one record

NULL behavior in SQL

Replacing NULLs safely

Impact on aggregations