You have just been paged because a software issue has halted production on a manufacturing line. The line-control application, PLC-facing integration service, and operator HMI are all in the path for releasing work, and operators report that recent jobs are stuck in an error state after a deployment. You need to restore production quickly without making the situation worse or introducing unsafe commands to plant-floor equipment.
How would you troubleshoot this incident from first signal to stable recovery? Explain how you would separate a bad release from an infrastructure, identity, or integration failure, and how you would decide between rollback, hotfix, or partial isolation while keeping the environment safe.