

A critical production security service has suffered a complex failure: authentication requests are intermittently timing out, some policy decisions are inconsistent across regions, and alert volume in your Nokia NetGuard Cybersecurity Dome has spiked without a single obvious root cause. The incident is affecting enterprise customers, and leadership wants service stabilized quickly while also understanding the underlying cause so the issue does not recur. You need to investigate in an environment where multiple recent changes overlap, telemetry is noisy, and every mitigation carries some risk of customer impact.
How would you investigate the underlying cause of this failure, decide what immediate actions to take, and determine whether to roll back, contain, or continue operating while you validate the root cause?