





Teams often use SQL as a first pass to validate raw data before reporting, modeling, or downstream processing. Interviewers ask this to see whether you can identify common data quality issues with simple, reliable queries.
Explain how you would perform basic database validation using SQL. Your answer should cover:
NULL values in important columnsKeep the discussion practical and focused on easy SQL techniques. The interviewer is not looking for a full data quality framework; they want to hear how you would use straightforward SELECT, WHERE, GROUP BY, and aggregate queries to catch common issues quickly.
A basic validation starts by checking whether the number of rows matches expectations. This helps detect missing loads, duplicate loads, or unexpected spikes before deeper analysis.
SELECT COUNT(*) AS row_count
FROM orders;
Important columns such as IDs, dates, or statuses should often not be NULL. Counting NULLs by column quickly reveals incomplete records and ingestion issues.
SELECT COUNT(*) AS missing_customer_id
FROM orders
WHERE customer_id IS NULL;
Duplicates can be checked by grouping on a business key and looking for counts greater than 1. This is useful when a table should contain one row per entity, event, or transaction.
SELECT order_id, COUNT(*) AS duplicate_count
FROM orders
GROUP BY order_id
HAVING COUNT(*) > 1;
SQL can verify whether values fall within expected ranges or approved categories. Examples include negative amounts, invalid statuses, or dates outside an expected period.
SELECT *
FROM orders
WHERE amount < 0
OR status NOT IN ('pending', 'shipped', 'cancelled');
Good validation does not stop at finding bad rows; it also summarizes the issue so others can act on it. Aggregations help quantify how many records fail each rule.
SELECT status, COUNT(*) AS row_count
FROM orders
GROUP BY status
ORDER BY row_count DESC;