You are responsible for a production service that coordinates internal requests across multiple dependencies and regions. A recent incident caused partial unavailability after one dependency degraded and traffic failed over unevenly. You need to make the system resilient enough that a single regional or dependency failure does not cascade into a broader outage.
How would you ensure the service is reliable and fault tolerant under dependency failures, regional loss, and partial network degradation? Explain what you would change in the architecture, how you would detect that failover is actually working, and what you would do when the system cannot degrade gracefully.
You are responsible for a production service that coordinates internal requests across multiple dependencies and regions. A recent incident caused partial unavailability after one dependency degraded and traffic failed over unevenly. You need to make the system resilient enough that a single regional or dependency failure does not cascade into a broader outage.
How would you ensure the service is reliable and fault tolerant under dependency failures, regional loss, and partial network degradation? Explain what you would change in the architecture, how you would detect that failover is actually working, and what you would do when the system cannot degrade gracefully.