Context
FinLedger, a B2B payments platform, syncs operational data from PostgreSQL into Snowflake for analytics and into a MySQL read replica used by downstream partner APIs. The current sync process is a mix of hourly batch jobs and change-data-capture (CDC), and the team is seeing intermittent row-count mismatches, duplicate records, and stale updates across targets.
You need to design a troubleshooting and remediation approach for database sync errors, not just a one-time fix. The goal is to identify where divergence occurs, contain bad data quickly, and make the pipeline observable and recoverable.
Scale Requirements
- Source DB: PostgreSQL 14, ~2 TB, 1.2B rows across 150 tables
- Change volume: 25M row changes/day, peak 1,500 updates/sec
- Targets: Snowflake analytics warehouse and MySQL 8 partner-serving database
- Latency target: CDC path < 2 minutes; batch reconciliation < 30 minutes
- Retention: 30 days of raw CDC logs, 1 year of reconciliation results
Requirements
- Design an end-to-end method to troubleshoot sync errors between source and target databases.
- Identify likely failure points: missed CDC events, out-of-order updates, schema drift, transaction replay issues, duplicate loads, and partial batch failures.
- Define how to validate correctness using row counts, checksums, high-watermarks, primary-key diffing, and idempotent reprocessing.
- Propose a remediation workflow for replaying failed ranges without creating duplicates or overwriting newer data.
- Include monitoring and alerting for freshness, data quality, and pipeline health.
- Show how orchestration coordinates CDC ingestion, reconciliation jobs, and backfills.
Constraints
- AWS-only infrastructure; no managed third-party observability platform
- Team of 3 data engineers and 1 platform engineer
- PCI-related payment data: PII must be masked outside the source system
- Incremental monthly budget cap: $18K
- Source PostgreSQL cannot tolerate heavy full-table scans during business hours