Troubleshoot Cross-Database Sync Failures

Context

FinLedger, a B2B payments platform, syncs operational data from PostgreSQL into Snowflake for analytics and into a MySQL read replica used by downstream partner APIs. The current sync process is a mix of hourly batch jobs and change-data-capture (CDC), and the team is seeing intermittent row-count mismatches, duplicate records, and stale updates across targets.

You need to design a troubleshooting and remediation approach for database sync errors, not just a one-time fix. The goal is to identify where divergence occurs, contain bad data quickly, and make the pipeline observable and recoverable.

Scale Requirements

Source DB: PostgreSQL 14, ~2 TB, 1.2B rows across 150 tables
Change volume: 25M row changes/day, peak 1,500 updates/sec
Targets: Snowflake analytics warehouse and MySQL 8 partner-serving database
Latency target: CDC path < 2 minutes; batch reconciliation < 30 minutes
Retention: 30 days of raw CDC logs, 1 year of reconciliation results

Requirements

Design an end-to-end method to troubleshoot sync errors between source and target databases.
Identify likely failure points: missed CDC events, out-of-order updates, schema drift, transaction replay issues, duplicate loads, and partial batch failures.
Define how to validate correctness using row counts, checksums, high-watermarks, primary-key diffing, and idempotent reprocessing.
Propose a remediation workflow for replaying failed ranges without creating duplicates or overwriting newer data.
Include monitoring and alerting for freshness, data quality, and pipeline health.
Show how orchestration coordinates CDC ingestion, reconciliation jobs, and backfills.

Constraints

AWS-only infrastructure; no managed third-party observability platform
Team of 3 data engineers and 1 platform engineer
PCI-related payment data: PII must be masked outside the source system
Incremental monthly budget cap: $18K
Source PostgreSQL cannot tolerate heavy full-table scans during business hours

Problem

Context

Scale Requirements

Source DB: PostgreSQL 14, ~2 TB, 1.2B rows across 150 tables
Change volume: 25M row changes/day, peak 1,500 updates/sec
Targets: Snowflake analytics warehouse and MySQL 8 partner-serving database
Latency target: CDC path < 2 minutes; batch reconciliation < 30 minutes
Retention: 30 days of raw CDC logs, 1 year of reconciliation results

Requirements

Design an end-to-end method to troubleshoot sync errors between source and target databases.
Identify likely failure points: missed CDC events, out-of-order updates, schema drift, transaction replay issues, duplicate loads, and partial batch failures.
Define how to validate correctness using row counts, checksums, high-watermarks, primary-key diffing, and idempotent reprocessing.
Propose a remediation workflow for replaying failed ranges without creating duplicates or overwriting newer data.
Include monitoring and alerting for freshness, data quality, and pipeline health.
Show how orchestration coordinates CDC ingestion, reconciliation jobs, and backfills.

Constraints

AWS-only infrastructure; no managed third-party observability platform
Team of 3 data engineers and 1 platform engineer
PCI-related payment data: PII must be masked outside the source system
Incremental monthly budget cap: $18K
Source PostgreSQL cannot tolerate heavy full-table scans during business hours

Problem

Context

Scale Requirements

Source DB: PostgreSQL 14, ~2 TB, 1.2B rows across 150 tables
Change volume: 25M row changes/day, peak 1,500 updates/sec
Targets: Snowflake analytics warehouse and MySQL 8 partner-serving database
Latency target: CDC path < 2 minutes; batch reconciliation < 30 minutes
Retention: 30 days of raw CDC logs, 1 year of reconciliation results

Requirements

Design an end-to-end method to troubleshoot sync errors between source and target databases.
Identify likely failure points: missed CDC events, out-of-order updates, schema drift, transaction replay issues, duplicate loads, and partial batch failures.
Define how to validate correctness using row counts, checksums, high-watermarks, primary-key diffing, and idempotent reprocessing.
Propose a remediation workflow for replaying failed ranges without creating duplicates or overwriting newer data.
Include monitoring and alerting for freshness, data quality, and pipeline health.
Show how orchestration coordinates CDC ingestion, reconciliation jobs, and backfills.

Constraints

AWS-only infrastructure; no managed third-party observability platform
Team of 3 data engineers and 1 platform engineer
PCI-related payment data: PII must be masked outside the source system
Incremental monthly budget cap: $18K
Source PostgreSQL cannot tolerate heavy full-table scans during business hours

Problem

Context

Scale Requirements

Source DB: PostgreSQL 14, ~2 TB, 1.2B rows across 150 tables
Change volume: 25M row changes/day, peak 1,500 updates/sec
Targets: Snowflake analytics warehouse and MySQL 8 partner-serving database
Latency target: CDC path < 2 minutes; batch reconciliation < 30 minutes
Retention: 30 days of raw CDC logs, 1 year of reconciliation results

Requirements

Design an end-to-end method to troubleshoot sync errors between source and target databases.
Identify likely failure points: missed CDC events, out-of-order updates, schema drift, transaction replay issues, duplicate loads, and partial batch failures.
Define how to validate correctness using row counts, checksums, high-watermarks, primary-key diffing, and idempotent reprocessing.
Propose a remediation workflow for replaying failed ranges without creating duplicates or overwriting newer data.
Include monitoring and alerting for freshness, data quality, and pipeline health.
Show how orchestration coordinates CDC ingestion, reconciliation jobs, and backfills.

Constraints

AWS-only infrastructure; no managed third-party observability platform
Team of 3 data engineers and 1 platform engineer
PCI-related payment data: PII must be masked outside the source system
Incremental monthly budget cap: $18K
Source PostgreSQL cannot tolerate heavy full-table scans during business hours

Interview Guides

Problem

Context

Scale Requirements

Requirements

Constraints

Problem

Context

Scale Requirements

Requirements

Constraints

Troubleshoot Cross-Database Sync Failures

Problem

Context

Scale Requirements

Requirements

Constraints

Problem

Context

Scale Requirements

Requirements

Constraints