
Teams often combine data from application databases, vendor feeds, and event logs into a single reporting layer. If the integration logic is weak, duplicates, missing matches, inconsistent formats, and incorrect aggregates can quickly make downstream analysis unreliable.
What techniques would you use to ensure data integrity when combining multiple data sources in SQL? In your answer, explain how you would validate join keys, handle duplicates, standardize data types and formats, manage NULLs, and verify that row counts and aggregates remain correct after combining data.
Keep the discussion practical and SQL-focused. The interviewer is not looking for a full data platform design; they want to hear the core checks, query patterns, and validation habits you would use before and after merging datasets in PostgreSQL.
Before combining tables, verify that the join columns are complete, standardized, and unique at the expected grain. If keys are duplicated or formatted inconsistently, even a correct JOIN can produce incorrect row multiplication or missed matches.
SELECT customer_id, COUNT(*) AS row_count
FROM source_a
GROUP BY customer_id
HAVING COUNT(*) > 1;
When multiple sources contain repeated records, define the business grain first, such as one row per order or one row per customer per day. Then remove duplicates explicitly instead of assuming source data is already clean.
SELECT DISTINCT order_id, customer_id, order_date
FROM source_orders;
Data from different systems may store the same field differently, such as text versus integer IDs or inconsistent date formats. Converting values to a common type and format before combining them prevents silent mismatches and comparison errors.
SELECT
CAST(customer_id AS INT) AS customer_id,
TO_DATE(order_date_text, 'YYYY-MM-DD') AS order_date
FROM raw_feed;
Missing values can affect joins, filters, and aggregates. You should decide whether NULL means unknown, not applicable, or missing data, and use LEFT JOIN, COALESCE, and explicit null checks based on the reporting requirement.
SELECT a.customer_id, COALESCE(b.region, 'Unknown') AS region
FROM customers a
LEFT JOIN customer_regions b
ON a.customer_id = b.customer_id;
After combining sources, validate the result with row counts, distinct counts, and aggregate comparisons against the original tables. Reconciliation helps catch duplicate joins, dropped records, and incorrect filters early.
SELECT COUNT(*) AS total_rows, COUNT(DISTINCT order_id) AS distinct_orders
FROM merged_orders;