You are responsible for a control plane that manages configuration and access for multiple production data services. It runs across two regions and must stay available during node loss, zone loss, and partial network partitions. A recent incident caused one region to serve stale state after a failover, and a second issue showed that some clients retried aggressively enough to amplify the outage.
You are responsible for a control plane that manages configuration and access for multiple production data services. It runs across two regions and must stay available during node loss, zone loss, and partial network partitions. A recent incident caused one region to serve stale state after a failover, and a second issue showed that some clients retried aggressively enough to amplify the outage.