



Before sharing data with a customer, analysts need to confirm that the dataset is accurate, complete, and internally consistent. Interviewers ask this to assess whether you can combine SQL checks with sound data validation habits.
Explain how you would verify that a dataset is clean and trustworthy before presenting it to a customer. Your answer should cover:
Keep the answer practical and SQL-focused. The interviewer is not looking for a complex pipeline design; they want a clear framework for validating a dataset, examples of simple PostgreSQL checks, and a structured explanation of how you would build confidence before presenting results.
A trustworthy dataset should not have unexpected missing values in key fields such as IDs, dates, or metrics required for reporting. In SQL, completeness checks usually start with counting NULLs and comparing row counts to expected baselines.
SELECT COUNT(*) AS missing_customer_id_rows
FROM customer_report
WHERE customer_id IS NULL;
Duplicate rows can inflate counts, sums, and customer-facing metrics. A common validation step is grouping by the expected business key and checking for counts greater than one.
SELECT customer_id, report_date, COUNT(*) AS row_count
FROM customer_report
GROUP BY customer_id, report_date
HAVING COUNT(*) > 1;
Data can be present but still wrong, such as negative revenue, impossible dates, or invalid status values. SQL is useful for flagging values outside allowed ranges or outside a known set of categories.
SELECT *
FROM customer_report
WHERE revenue < 0
OR report_date > CURRENT_DATE;
Even if row-level data looks reasonable, totals may still be wrong. Analysts often compare aggregated outputs to source totals, prior reports, or known benchmarks to confirm the final numbers are believable.
SELECT report_date, SUM(revenue) AS total_revenue
FROM customer_report
GROUP BY report_date
ORDER BY report_date;
Finding a problem is only part of the job; you also need to decide whether to fix, exclude, annotate, or escalate it. In interviews, strong answers explain both the SQL checks and the decision-making process after anomalies are found.