You are responsible for a distributed service that accepts writes in one region and replicates them to several downstream stores. Clients retry aggressively, network partitions happen, and two writers can update the same record at nearly the same time. A recent incident showed that the same logical update was applied twice in one store and lost in another. You need to prevent inconsistent state without turning the system into a single-region bottleneck.
How would you design the write path so that concurrent updates, retries, and partial failures do not corrupt state? What guarantees would you provide, and how would you detect and recover from divergence when it still happens?
You are responsible for a distributed service that accepts writes in one region and replicates them to several downstream stores. Clients retry aggressively, network partitions happen, and two writers can update the same record at nearly the same time. A recent incident showed that the same logical update was applied twice in one store and lost in another. You need to prevent inconsistent state without turning the system into a single-region bottleneck.
How would you design the write path so that concurrent updates, retries, and partial failures do not corrupt state? What guarantees would you provide, and how would you detect and recover from divergence when it still happens?