Context
You’re on the data platform team at a fintech payments company processing 8–12M card transactions/day. The core ledger is stored in Postgres and feeds downstream risk models, chargeback workflows, and regulatory reporting. A schema change is needed: you must migrate from a legacy transactions.status string field to a normalized transaction_state table plus a new transactions.state_id foreign key. The migration must be executed in production with near-zero downtime, and mistakes could cause failed authorizations, incorrect customer balances, or inaccurate compliance reports.
Core Question
Describe your experience with database migrations in a production environment. In your answer, cover the following:
- Planning & rollout strategy: How do you design a migration plan that minimizes risk (e.g., expand/contract pattern, feature flags, phased rollout)?
- Backfill approach: How do you backfill historical data safely and efficiently (batching, idempotency, retry strategy, avoiding long locks)?
- Application compatibility: How do you handle dual reads / dual writes during the transition, and how do you ensure old and new code paths remain consistent?
- Validation & monitoring: What SQL checks do you run to validate correctness (row counts, mismatch detection, referential integrity, sampling, reconciliation)?
- Rollback & incident response: What does a rollback plan look like, and what signals/metrics would trigger it?
Scope Guidance (what a strong answer includes)
- Concrete examples of online schema changes (adding nullable columns, backfilling, adding constraints later, swapping reads, then dropping old columns).
- Discussion of operational concerns: lock timeouts, index creation strategy (concurrent), long-running transactions, replication lag, and deployment sequencing.
- At least 2–3 example SQL validation queries you would run during/after the migration (e.g., mismatch counts, orphan detection, completeness over time windows).
- Trade-offs: speed vs safety, correctness vs availability, and how you communicate risk to stakeholders (SRE, compliance, product).