Design Zero-Downtime Database Migration Pipeline

Context

LedgerFlow, a B2B payments platform, runs its operational workloads on Amazon RDS for PostgreSQL and supports internal analytics from nightly exports. The company is migrating to Snowflake for analytics and selected domain services, but the current batch-only export process causes stale data, inconsistent schemas, and risky cutovers.

You need to design a migration pipeline that performs an initial historical load plus ongoing change data capture (CDC), keeps source and target synchronized during the migration window, and supports a controlled cutover with rollback.

Scale Requirements

Source database: PostgreSQL 13 on RDS, ~12 TB total data, 4.5B rows across 180 tables
Change volume: 25K row changes/sec peak, 6K avg
Largest table: payments with 1.2B rows, ~4 TB
Latency target: CDC data visible in Snowflake within 3 minutes
Migration window: complete initial load in 7 days without impacting OLTP p95 latency by more than 10%
Retention: raw CDC logs for 30 days, curated warehouse tables indefinitely

Requirements

Design an initial bulk-load strategy for large tables and a CDC pipeline for ongoing updates.
Ensure idempotent processing so retries do not create duplicates or inconsistent state.
Handle schema evolution, late-arriving changes, deletes, and table backfills.
Provide table-level and row-level reconciliation between PostgreSQL and Snowflake before cutover.
Orchestrate dependencies across extraction, load, validation, and promotion steps.
Define a rollback plan if data quality checks fail after cutover.
Support selective reprocessing for a single table or date range.

Constraints

AWS is the required cloud environment; existing tooling includes Airflow and S3.
Budget for incremental migration infrastructure is $30K/month.
Source database is customer-facing; long-running locks and full-table scans during business hours are not allowed.
Compliance: SOC 2 and PCI scope reduction; sensitive columns must be encrypted or tokenized before broad analytics access.

Context

Scale Requirements

Source database: PostgreSQL 13 on RDS, ~12 TB total data, 4.5B rows across 180 tables
Change volume: 25K row changes/sec peak, 6K avg
Largest table: payments with 1.2B rows, ~4 TB
Latency target: CDC data visible in Snowflake within 3 minutes
Migration window: complete initial load in 7 days without impacting OLTP p95 latency by more than 10%
Retention: raw CDC logs for 30 days, curated warehouse tables indefinitely

Requirements

Design an initial bulk-load strategy for large tables and a CDC pipeline for ongoing updates.
Ensure idempotent processing so retries do not create duplicates or inconsistent state.
Handle schema evolution, late-arriving changes, deletes, and table backfills.
Provide table-level and row-level reconciliation between PostgreSQL and Snowflake before cutover.
Orchestrate dependencies across extraction, load, validation, and promotion steps.
Define a rollback plan if data quality checks fail after cutover.
Support selective reprocessing for a single table or date range.

Constraints

AWS is the required cloud environment; existing tooling includes Airflow and S3.
Budget for incremental migration infrastructure is $30K/month.
Source database is customer-facing; long-running locks and full-table scans during business hours are not allowed.
Compliance: SOC 2 and PCI scope reduction; sensitive columns must be encrypted or tokenized before broad analytics access.

Context

Scale Requirements

Source database: PostgreSQL 13 on RDS, ~12 TB total data, 4.5B rows across 180 tables
Change volume: 25K row changes/sec peak, 6K avg
Largest table: payments with 1.2B rows, ~4 TB
Latency target: CDC data visible in Snowflake within 3 minutes
Migration window: complete initial load in 7 days without impacting OLTP p95 latency by more than 10%
Retention: raw CDC logs for 30 days, curated warehouse tables indefinitely

Requirements

Design an initial bulk-load strategy for large tables and a CDC pipeline for ongoing updates.
Ensure idempotent processing so retries do not create duplicates or inconsistent state.
Handle schema evolution, late-arriving changes, deletes, and table backfills.
Provide table-level and row-level reconciliation between PostgreSQL and Snowflake before cutover.
Orchestrate dependencies across extraction, load, validation, and promotion steps.
Define a rollback plan if data quality checks fail after cutover.
Support selective reprocessing for a single table or date range.

Constraints

AWS is the required cloud environment; existing tooling includes Airflow and S3.
Budget for incremental migration infrastructure is $30K/month.
Source database is customer-facing; long-running locks and full-table scans during business hours are not allowed.
Compliance: SOC 2 and PCI scope reduction; sensitive columns must be encrypted or tokenized before broad analytics access.

Context

Scale Requirements

Source database: PostgreSQL 13 on RDS, ~12 TB total data, 4.5B rows across 180 tables
Change volume: 25K row changes/sec peak, 6K avg
Largest table: payments with 1.2B rows, ~4 TB
Latency target: CDC data visible in Snowflake within 3 minutes
Migration window: complete initial load in 7 days without impacting OLTP p95 latency by more than 10%
Retention: raw CDC logs for 30 days, curated warehouse tables indefinitely

Requirements

Design an initial bulk-load strategy for large tables and a CDC pipeline for ongoing updates.
Ensure idempotent processing so retries do not create duplicates or inconsistent state.
Handle schema evolution, late-arriving changes, deletes, and table backfills.
Provide table-level and row-level reconciliation between PostgreSQL and Snowflake before cutover.
Orchestrate dependencies across extraction, load, validation, and promotion steps.
Define a rollback plan if data quality checks fail after cutover.
Support selective reprocessing for a single table or date range.

Constraints

AWS is the required cloud environment; existing tooling includes Airflow and S3.
Budget for incremental migration infrastructure is $30K/month.
Source database is customer-facing; long-running locks and full-table scans during business hours are not allowed.
Compliance: SOC 2 and PCI scope reduction; sensitive columns must be encrypted or tokenized before broad analytics access.

Interview Guides

Context

Scale Requirements

Requirements

Constraints

Design Zero-Downtime Database Migration Pipeline

Context

Scale Requirements

Requirements

Constraints

Your Answer

Design Zero-Downtime Database Migration Pipeline

Context

Scale Requirements

Requirements

Constraints

Design Zero-Downtime Database Migration Pipeline

Context

Scale Requirements

Requirements

Constraints

Your Answer