You’re on the data platform team at a fintech payments company processing 8–12M card transactions/day. The core ledger is stored in Postgres and feeds downstream risk models, chargeback workflows, and regulatory reporting. A schema change is needed: you must migrate from a legacy transactions.status string field to a normalized transaction_state table plus a new transactions.state_id foreign key. The migration must be executed in production with near-zero downtime, and mistakes could cause failed authorizations, incorrect customer balances, or inaccurate compliance reports.
Describe your experience with database migrations in a production environment. In your answer, cover the following:
A safe production pattern where you first expand the schema (add new tables/columns), run backfills and dual-writes, then contract (switch reads and remove old schema). It reduces downtime by avoiding blocking changes and allowing gradual cutover.
ALTER TABLE transactions ADD COLUMN state_id BIGINT NULL;
-- Later, after backfill and validation:
ALTER TABLE transactions ALTER COLUMN state_id SET NOT NULL;
Backfills should be restartable without corrupting data. Batching limits lock duration and reduces replication lag; idempotency is often achieved by updating only rows missing the new value and using deterministic mappings.
WITH batch AS (
SELECT id
FROM transactions
WHERE state_id IS NULL
ORDER BY id
LIMIT 5000
)
UPDATE transactions t
SET state_id = s.id
FROM transaction_state s
JOIN batch b ON b.id = t.id
WHERE t.status = s.legacy_status
AND t.state_id IS NULL;
During the transition, the application writes to both old and new representations (or writes new and derives old) while reads may be toggled via feature flags. This prevents breaking older services and enables gradual rollout and rollback.
SELECT
COUNT(*) AS mismatches
FROM transactions t
JOIN transaction_state s ON s.id = t.state_id
WHERE t.status <> s.legacy_status;
Production migrations require continuous validation: completeness (no NULLs), correctness (no mismatches), and integrity (no orphan foreign keys). Checks are often scoped to recent time windows to catch regressions quickly.
SELECT COUNT(*) AS orphan_rows
FROM transactions t
LEFT JOIN transaction_state s ON s.id = t.state_id
WHERE t.state_id IS NOT NULL
AND s.id IS NULL;
Large tables require avoiding long exclusive locks. Common techniques include creating indexes concurrently, adding constraints as NOT VALID then validating, and setting lock timeouts to fail fast rather than stall production.
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_transactions_state_id ON transactions(state_id);
ALTER TABLE transactions
ADD CONSTRAINT fk_transactions_state
FOREIGN KEY (state_id) REFERENCES transaction_state(id)
NOT VALID;
ALTER TABLE transactions VALIDATE CONSTRAINT fk_transactions_state;