Interview Guides

Diagnose Cross-System Data Discrepancies | Dataford Interview Questions - Dataford - Ace your Interview

Diagnose Cross-System Data Discrepancies

Hard

SQL & Data Manipulation

Asked at 6 companies6JoinsData Wrangling

Also asked at

Problem

Context

Cross-system metric mismatches are common in analytics environments, especially when reporting in platforms such as AbbVie customer insight dashboards depends on multiple upstream data sources. Interviewers want to see whether you can move from “the numbers do not match” to a structured SQL-based diagnosis.

Question

Describe how you would identify the root cause of a data discrepancy between two systems that should report the same customer metric. Explain how you would use SQL to compare row counts, key coverage, duplicates, null handling, transformation logic, and timing differences. You should also discuss how you would narrow the issue from aggregate mismatch to specific records and how you would validate whether the discrepancy comes from joins, filters, late-arriving data, or business-rule differences.

Scope guidance

Answer at the level of a senior analyst: focus on a practical, repeatable debugging workflow, the SQL patterns you would use, and how you would communicate findings once you isolate the issue.

Key Concepts

Reconcile aggregates before drilling into records

Start by confirming the discrepancy at the highest useful level, such as by date, channel, or customer segment. This tells you whether the issue is global or isolated and helps you reduce the search space before comparing individual rows.

SELECT report_date, COUNT(*) AS row_count, SUM(metric_value) AS total_value
FROM system_a_metrics
GROUP BY report_date;

Use joins to identify missing and extra records

A FULL OUTER JOIN on the business key is often the fastest way to find records present in one system but not the other. This distinguishes coverage issues from value mismatches and immediately surfaces whether the discrepancy is due to dropped or duplicated records.

SELECT COALESCE(a.customer_id, b.customer_id) AS customer_id,
       a.metric_value AS a_value,
       b.metric_value AS b_value
FROM system_a_metrics a
FULL OUTER JOIN system_b_metrics b
  ON a.customer_id = b.customer_id
WHERE a.customer_id IS NULL
   OR b.customer_id IS NULL
   OR a.metric_value <> b.metric_value;

Check duplication and grain mismatches

Two systems can disagree even when they contain the same keys if one table is stored at a finer grain or a join multiplies rows. Counting records per business key and comparing expected uniqueness is critical before trusting any aggregate comparison.

SELECT customer_id, COUNT(*) AS records_per_key
FROM system_b_metrics
GROUP BY customer_id
HAVING COUNT(*) > 1;

Validate transformation and filtering logic

Many discrepancies come from inconsistent CASE logic, date filters, status filters, or null handling. Rebuilding the metric in SQL from raw inputs for both systems helps isolate whether the issue is in source data or in business-rule implementation.

SELECT customer_id,
       CASE WHEN status = 'active' AND amount > 0 THEN amount ELSE 0 END AS normalized_value
FROM raw_events;

Use CTEs and window functions to trace timing issues

Late-arriving data, multiple updates per key, and snapshot timing often create mismatches between operational and reporting systems. CTEs and window functions such as ROW_NUMBER() help you compare the latest version of each record at a consistent cutoff time.

WITH ranked AS (
  SELECT customer_id, updated_at, metric_value,
         ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY updated_at DESC) AS rn
  FROM system_a_history
)
SELECT customer_id, metric_value
FROM ranked
WHERE rn = 1;

Problem

Context

Question

Scope guidance

Answer at the level of a senior analyst: focus on a practical, repeatable debugging workflow, the SQL patterns you would use, and how you would communicate findings once you isolate the issue.

Key Concepts

Reconcile aggregates before drilling into records

SELECT report_date, COUNT(*) AS row_count, SUM(metric_value) AS total_value
FROM system_a_metrics
GROUP BY report_date;

Use joins to identify missing and extra records

SELECT COALESCE(a.customer_id, b.customer_id) AS customer_id,
       a.metric_value AS a_value,
       b.metric_value AS b_value
FROM system_a_metrics a
FULL OUTER JOIN system_b_metrics b
  ON a.customer_id = b.customer_id
WHERE a.customer_id IS NULL
   OR b.customer_id IS NULL
   OR a.metric_value <> b.metric_value;

Check duplication and grain mismatches

SELECT customer_id, COUNT(*) AS records_per_key
FROM system_b_metrics
GROUP BY customer_id
HAVING COUNT(*) > 1;

Validate transformation and filtering logic

SELECT customer_id,
       CASE WHEN status = 'active' AND amount > 0 THEN amount ELSE 0 END AS normalized_value
FROM raw_events;

Use CTEs and window functions to trace timing issues

WITH ranked AS (
  SELECT customer_id, updated_at, metric_value,
         ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY updated_at DESC) AS rn
  FROM system_a_history
)
SELECT customer_id, metric_value
FROM ranked
WHERE rn = 1;

Your answer

Try one AI text evaluation on us

Get structured feedback, scored against a 4-axis rubric. Premium unlocks unlimited.

0 wordstarget ~200

Up next

Diagnose Cross-System Reporting DriftMedium

Using SQL to Investigate Customer IssuesEasy

Diagnose Inconsistent Customer ReportMedium

Next question

Diagnose Cross-System Data Discrepancies

Hard

SQL & Data Manipulation

Asked at 6 companies6JoinsData Wrangling

Also asked at

Problem

Context

Question

Scope guidance

Answer at the level of a senior analyst: focus on a practical, repeatable debugging workflow, the SQL patterns you would use, and how you would communicate findings once you isolate the issue.

Key Concepts

Reconcile aggregates before drilling into records

SELECT report_date, COUNT(*) AS row_count, SUM(metric_value) AS total_value
FROM system_a_metrics
GROUP BY report_date;

Use joins to identify missing and extra records

SELECT COALESCE(a.customer_id, b.customer_id) AS customer_id,
       a.metric_value AS a_value,
       b.metric_value AS b_value
FROM system_a_metrics a
FULL OUTER JOIN system_b_metrics b
  ON a.customer_id = b.customer_id
WHERE a.customer_id IS NULL
   OR b.customer_id IS NULL
   OR a.metric_value <> b.metric_value;

Check duplication and grain mismatches

SELECT customer_id, COUNT(*) AS records_per_key
FROM system_b_metrics
GROUP BY customer_id
HAVING COUNT(*) > 1;

Validate transformation and filtering logic

SELECT customer_id,
       CASE WHEN status = 'active' AND amount > 0 THEN amount ELSE 0 END AS normalized_value
FROM raw_events;

Use CTEs and window functions to trace timing issues

WITH ranked AS (
  SELECT customer_id, updated_at, metric_value,
         ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY updated_at DESC) AS rn
  FROM system_a_history
)
SELECT customer_id, metric_value
FROM ranked
WHERE rn = 1;

Problem

Context

Question

Scope guidance

Answer at the level of a senior analyst: focus on a practical, repeatable debugging workflow, the SQL patterns you would use, and how you would communicate findings once you isolate the issue.

Key Concepts

Reconcile aggregates before drilling into records

SELECT report_date, COUNT(*) AS row_count, SUM(metric_value) AS total_value
FROM system_a_metrics
GROUP BY report_date;

Use joins to identify missing and extra records

SELECT COALESCE(a.customer_id, b.customer_id) AS customer_id,
       a.metric_value AS a_value,
       b.metric_value AS b_value
FROM system_a_metrics a
FULL OUTER JOIN system_b_metrics b
  ON a.customer_id = b.customer_id
WHERE a.customer_id IS NULL
   OR b.customer_id IS NULL
   OR a.metric_value <> b.metric_value;

Check duplication and grain mismatches

SELECT customer_id, COUNT(*) AS records_per_key
FROM system_b_metrics
GROUP BY customer_id
HAVING COUNT(*) > 1;

Validate transformation and filtering logic

SELECT customer_id,
       CASE WHEN status = 'active' AND amount > 0 THEN amount ELSE 0 END AS normalized_value
FROM raw_events;

Use CTEs and window functions to trace timing issues

WITH ranked AS (
  SELECT customer_id, updated_at, metric_value,
         ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY updated_at DESC) AS rn
  FROM system_a_history
)
SELECT customer_id, metric_value
FROM ranked
WHERE rn = 1;

Your answer

Try one AI text evaluation on us

Get structured feedback, scored against a 4-axis rubric. Premium unlocks unlimited.

0 wordstarget ~200

Up next

Diagnose Cross-System Reporting DriftMedium

Using SQL to Investigate Customer IssuesEasy

Diagnose Inconsistent Customer ReportMedium

Next question