You are responsible for a customer-facing service that must stay available during regional outages, deploy failures, and partial infrastructure loss. The service depends on a primary database, internal APIs, and shared network paths, and your team has seen a recent failover expose gaps in recovery timing and data consistency. You need to make the system resilient without creating a split-brain or unsafe recovery path.
How would you design high availability and disaster recovery for this service, and what trade-offs would you make between availability, consistency, and operational complexity? Walk me through how you would prove the design works under failure, including how you would detect a bad failover and safely roll back.
You are responsible for a customer-facing service that must stay available during regional outages, deploy failures, and partial infrastructure loss. The service depends on a primary database, internal APIs, and shared network paths, and your team has seen a recent failover expose gaps in recovery timing and data consistency. You need to make the system resilient without creating a split-brain or unsafe recovery path.
How would you design high availability and disaster recovery for this service, and what trade-offs would you make between availability, consistency, and operational complexity? Walk me through how you would prove the design works under failure, including how you would detect a bad failover and safely roll back.