
Data accuracy is a core expectation in analytics and data engineering work. Interviewers ask this to understand whether you can produce trustworthy results, not just write queries that run.
Explain how you ensure data accuracy in your SQL workflow. Your answer should cover how you validate source data, check query logic, handle duplicates and nulls, verify aggregations, and confirm that final outputs match business expectations.
Keep the discussion practical. Focus on the steps you would take while writing and reviewing SQL: profiling raw data, applying filters carefully, validating joins, reconciling row counts and totals, and using simple checks to catch mistakes before sharing results. You can mention examples such as comparing pre- and post-transformation counts, checking for unexpected nulls, or validating totals against a trusted source.
Before writing business logic, inspect the raw table to understand row counts, null rates, duplicate keys, and value distributions. This helps catch bad assumptions early and prevents downstream errors.
SELECT COUNT(*) AS row_count,
COUNT(customer_id) AS non_null_customer_id,
COUNT(DISTINCT customer_id) AS distinct_customers
FROM orders;
Many accuracy issues come from joins that accidentally duplicate or drop rows. You should verify key uniqueness and compare row counts before and after joins to confirm the join behaves as expected.
SELECT customer_id, COUNT(*) AS cnt
FROM customers
GROUP BY customer_id
HAVING COUNT(*) > 1;
After transformation, compare totals, counts, or grouped summaries against a trusted baseline. Reconciliation is a simple way to confirm that filters, aggregations, and calculations did not distort the data.
SELECT order_status, COUNT(*) AS orders, SUM(order_amount) AS total_amount
FROM orders
GROUP BY order_status
ORDER BY order_status;
Null values and duplicate records can silently change results, especially in aggregations. A reliable workflow includes explicit checks for missing values and repeated business keys before final reporting.
SELECT order_id, COUNT(*) AS cnt
FROM orders
GROUP BY order_id
HAVING COUNT(*) > 1;
Technical correctness is not enough; results must also make sense in context. Validate outputs against known business rules such as non-negative revenue, valid date ranges, or expected category totals.
SELECT *
FROM orders
WHERE order_amount < 0
OR order_date > CURRENT_DATE;