Dataford
Interview Guides
Upgrade
All questions/SQL & Data Manipulation/Handling Duplicates and Nulls in SQL

Handling Duplicates and Nulls in SQL

Easy
SQL & Data Manipulation
Asked at 2 companies2Group ByCase WhenData Wrangling
Also asked at
Publicis SapientApptio

Problem

Context

Data quality issues often show up as duplicate rows and missing values. In analytics and operational systems, both can distort counts, aggregations, and downstream reporting if they are not handled carefully.

Question

Explain how you would handle duplicate records and NULL values in a dataset using SQL. Your answer should cover:

  1. How to identify duplicates
  2. How to remove or retain the correct record when duplicates exist
  3. How NULL values affect filtering, grouping, and aggregations
  4. Common SQL techniques for replacing, excluding, or preserving NULLs depending on the use case

Scope Guidance

The interviewer expects a practical explanation, not just definitions. Discuss the trade-offs between deleting data, deduplicating in queries, and preventing bad data at ingestion. Use simple SQL examples and mention common mistakes, especially around COUNT, GROUP BY, and comparisons with NULL.

Key Concepts

Identifying duplicate records

Duplicates are usually found by grouping on the business key rather than the physical primary key. If a combination of columns appears more than once, it may indicate repeated records that should be reviewed or consolidated.

SELECT customer_email, order_date, COUNT(*) AS duplicate_count
FROM orders
GROUP BY customer_email, order_date
HAVING COUNT(*) > 1;

Deduplicating while keeping one record

When duplicates exist, you often need a rule to decide which row to keep, such as the latest timestamp or the lowest ID. The key idea is that deduplication is not only about removing extras, but also about preserving the correct version of the data.

SELECT DISTINCT customer_email, order_date
FROM orders;

NULL behavior in SQL

NULL means missing or unknown, not an empty string or zero. Comparisons like = NULL do not work in SQL; you must use IS NULL or IS NOT NULL to test for missing values.

SELECT *
FROM customers
WHERE phone_number IS NULL;

Replacing NULLs safely

Sometimes missing values should be replaced for reporting or display purposes. In PostgreSQL, COALESCE is commonly used to substitute a fallback value without changing the underlying stored data.

SELECT customer_name, COALESCE(city, 'Unknown') AS city
FROM customers;

Impact on aggregations

Aggregate functions treat NULLs differently: COUNT(column) ignores NULLs, while COUNT(*) counts all rows. This distinction is important when measuring completeness or summarizing data accurately.

SELECT COUNT(*) AS total_rows,
       COUNT(discount_code) AS rows_with_discount
FROM orders;

Problem

Context

Data quality issues often show up as duplicate rows and missing values. In analytics and operational systems, both can distort counts, aggregations, and downstream reporting if they are not handled carefully.

Question

Explain how you would handle duplicate records and NULL values in a dataset using SQL. Your answer should cover:

  1. How to identify duplicates
  2. How to remove or retain the correct record when duplicates exist
  3. How NULL values affect filtering, grouping, and aggregations
  4. Common SQL techniques for replacing, excluding, or preserving NULLs depending on the use case

Scope Guidance

The interviewer expects a practical explanation, not just definitions. Discuss the trade-offs between deleting data, deduplicating in queries, and preventing bad data at ingestion. Use simple SQL examples and mention common mistakes, especially around COUNT, GROUP BY, and comparisons with NULL.

Key Concepts

Identifying duplicate records

Duplicates are usually found by grouping on the business key rather than the physical primary key. If a combination of columns appears more than once, it may indicate repeated records that should be reviewed or consolidated.

SELECT customer_email, order_date, COUNT(*) AS duplicate_count
FROM orders
GROUP BY customer_email, order_date
HAVING COUNT(*) > 1;

Deduplicating while keeping one record

When duplicates exist, you often need a rule to decide which row to keep, such as the latest timestamp or the lowest ID. The key idea is that deduplication is not only about removing extras, but also about preserving the correct version of the data.

SELECT DISTINCT customer_email, order_date
FROM orders;

NULL behavior in SQL

NULL means missing or unknown, not an empty string or zero. Comparisons like = NULL do not work in SQL; you must use IS NULL or IS NOT NULL to test for missing values.

SELECT *
FROM customers
WHERE phone_number IS NULL;

Replacing NULLs safely

Sometimes missing values should be replaced for reporting or display purposes. In PostgreSQL, COALESCE is commonly used to substitute a fallback value without changing the underlying stored data.

SELECT customer_name, COALESCE(city, 'Unknown') AS city
FROM customers;

Impact on aggregations

Aggregate functions treat NULLs differently: COUNT(column) ignores NULLs, while COUNT(*) counts all rows. This distinction is important when measuring completeness or summarizing data accurately.

SELECT COUNT(*) AS total_rows,
       COUNT(discount_code) AS rows_with_discount
FROM orders;
Your answer
Try one AI text evaluation on us
Get structured feedback, scored against a 4-axis rubric. Premium unlocks unlimited.
0 wordstarget ~200
Up next
ZendeskHandling Missing Values in SQLEasyAHandling Nulls in Analytics PrepEasyQuoraHandling Missing Data in SQLEasy
Next question