You own a production service deployed on Kubernetes, and a new release has just been rolled out through the CI/CD pipeline. Within minutes, customer-facing errors increase, some pods enter CrashLoopBackOff, and downstream dependencies begin timing out. The deployment included both application changes and infrastructure configuration updates, and you need to determine whether this is a bad build, a runtime misconfiguration, a dependency issue, or a security control blocking the release.
How would you troubleshoot this production deployment failure end to end, decide whether to roll back, and restore service safely? Be explicit about how you separate application, infrastructure, and security causes while preserving evidence and limiting blast radius.
New container image and Kubernetes manifest were deployed togetherPods are restarting and some never become readyCustomer-facing errors increased immediately after rolloutDownstream timeouts may indicate network, identity, or secret access issues