You own a platform that runs user-submitted analysis workloads, exposes APIs for orchestration, and stores sensitive research data and metadata. Usage has grown quickly, and recent incidents showed that traffic spikes, dependency failures, and a misconfigured deployment can all degrade availability. At the same time, the platform must preserve strong access controls, auditability, and data protection while scaling. You need an approach that improves reliability without weakening the security posture.
How would you design the system so it scales safely and remains reliable under failure, and how would you decide which components fail open, fail closed, or degrade gracefully? Be explicit about the security controls, operational trade-offs, and how you would verify the design works in production.