Design Reliable Fault-Tolerant Service Recovery

Scenario

You are responsible for a production service that coordinates internal requests across multiple dependencies and regions. A recent incident caused partial unavailability after one dependency degraded and traffic failed over unevenly. You need to make the system resilient enough that a single regional or dependency failure does not cascade into a broader outage.

Question

How would you ensure the service is reliable and fault tolerant under dependency failures, regional loss, and partial network degradation? Explain what you would change in the architecture, how you would detect that failover is actually working, and what you would do when the system cannot degrade gracefully.

Interview Guides

Scenario

Scenario

Question

Scenario

Scenario

Question

Design Reliable Fault-Tolerant Service Recovery

Scenario

Scenario

Question

Scenario

Scenario

Question