Context
ShopSphere, a global marketplace processing orders, payments, and inventory updates, currently runs on a single PostgreSQL 13 primary with read replicas. The monolithic database has reached write IOPS limits, vacuum contention is increasing, and maintenance windows are no longer acceptable. Leadership wants a zero-downtime migration to a sharded architecture while preserving existing ETL pipelines, CDC feeds, and downstream analytics.
You are asked to design the migration pipeline and cutover plan. Focus on how data is copied, validated, dual-written or replicated, backfilled, and switched over without interrupting application traffic or breaking downstream consumers.
Scale Requirements
- Database size: 28 TB total, 11 TB hot data
- Traffic: 180K reads/sec, 22K writes/sec peak
- Tables: 1,200 total; 75 latency-critical OLTP tables
- CDC volume: 350 MB/sec peak WAL generation
- Cutover target: zero planned downtime, < 250 ms added write latency during migration
- Data correctness: no lost committed writes; duplicate writes must be detectable and reconciled
- Retention: 7 years for financial records, GDPR deletion within 72 hours
Requirements
- Design a phased migration from a single PostgreSQL cluster to a sharded architecture using a deterministic shard key.
- Support historical backfill of existing data and continuous change capture for ongoing writes.
- Keep existing batch and streaming consumers operational during migration.
- Define how application reads/writes are routed before, during, and after cutover.
- Implement idempotent replay, consistency checks, and shard-level reconciliation.
- Describe rollback strategy if one or more shards fall behind or serve inconsistent data.
- Include orchestration, observability, and runbook-driven cutover controls.
Constraints
- AWS-only deployment; existing stack includes PostgreSQL, Debezium, Kafka, Airflow, and S3
- Team cannot rewrite all services at once; migration must support partial table-by-table onboarding
- PCI and GDPR data must remain encrypted in transit and at rest
- Incremental budget cap: $40K/month during migration
- Some tables require cross-shard lookups, but distributed transactions should be minimized