Diagnose Bad Data in Pipelines

Scenario

You're supporting a data pipeline that feeds customer-facing workflows, and a customer reports behavior that looks wrong in the product. Before treating it as an application bug, you want to determine whether the issue comes from incorrect, missing, duplicated, or delayed data moving through the pipeline.

Question

What would you do if you suspected a customer issue was caused by bad data rather than a product bug?

Problem

Scenario

Question

What would you do if you suspected a customer issue was caused by bad data rather than a product bug?

What to Inspect

Raw source payloads for the affected customer
Ingestion timestamps versus event timestamps
Duplicate keys such as external_record_id
Schema drift, null spikes, and rejected rows
Differences between raw, transformed, and served tables

Problem

Scenario

Question

What would you do if you suspected a customer issue was caused by bad data rather than a product bug?

What to Inspect

Raw source payloads for the affected customer
Ingestion timestamps versus event timestamps
Duplicate keys such as external_record_id
Schema drift, null spikes, and rejected rows
Differences between raw, transformed, and served tables

Problem

Scenario

Question

What would you do if you suspected a customer issue was caused by bad data rather than a product bug?

What to Inspect

Raw source payloads for the affected customer
Ingestion timestamps versus event timestamps
Duplicate keys such as external_record_id
Schema drift, null spikes, and rejected rows
Differences between raw, transformed, and served tables

Interview Guides

Problem

Scenario

Question

What to Inspect

Problem

Scenario

Question

What to Inspect

Diagnose Bad Data in Pipelines

Problem

Scenario

Question

What to Inspect

Problem

Scenario

Question

What to Inspect