
Teams often validate ETL reliability by comparing daily record counts in source systems against warehouse tables. This is a common first-line data quality check because it quickly surfaces missing loads, duplicate loads, and date-partition issues.
Explain how you would design a process to compare daily source counts against warehouse counts. Your answer should cover:
The interviewer is looking for a practical design explanation, not a full production implementation. Focus on a simple, reliable reconciliation pattern using SQL aggregations, clear assumptions, and operational considerations such as scheduling, idempotency, and alerting.
The first design decision is the level at which counts are compared, usually by business date, load date, table, and sometimes source system or partition. If the grain is not defined clearly, count comparisons can be misleading because source and warehouse records may not align on the same date field.
SELECT business_date, COUNT(*) AS row_count
FROM source_orders
GROUP BY business_date;
Source and warehouse counts must be based on the same date definition, such as event_date, created_at::date, or ingestion_date. A common failure is comparing source event dates to warehouse load dates, which creates false mismatches even when the pipeline is working correctly.
SELECT created_at::date AS business_date, COUNT(*) AS row_count
FROM warehouse_orders
GROUP BY created_at::date;
A reconciliation process should compare counts side by side and calculate both absolute and percentage differences. This makes it easier to separate small expected variances from major failures such as missing partitions or duplicate loads.
SELECT
business_date,
source_count,
warehouse_count,
warehouse_count - source_count AS count_diff
FROM daily_recon;
Not every mismatch should trigger the same response. Some pipelines allow late-arriving data or small expected differences, so the process should support thresholds, grace periods, and rerun logic before marking a day as failed.
SELECT
business_date,
CASE
WHEN ABS(warehouse_count - source_count) = 0 THEN 'match'
WHEN ABS(warehouse_count - source_count) <= 5 THEN 'within_threshold'
ELSE 'mismatch'
END AS status
FROM daily_recon;
A good design stores reconciliation results in an audit table so teams can track trends, reruns, and issue resolution over time. This also makes alerting and dashboarding much easier than recalculating everything ad hoc.
INSERT INTO recon_audit (table_name, business_date, source_count, warehouse_count, status)
VALUES ('orders', CURRENT_DATE - INTERVAL '1 day', 1200, 1198, 'mismatch');