You own a network-connected embedded appliance that uses secure boot, signed firmware, and a management plane reachable over a dedicated interface. After a routine firmware rollout, a subset of devices begins rebooting repeatedly, some never establish management connectivity, and a few that do come online report attestation mismatches. Operations is unsure whether this is a reliability bug, a bad rollout, or an active compromise.
How would you diagnose this failure end to end, from first triage through root-cause isolation, while protecting evidence and reducing the risk of making a potentially compromised fleet state worse? Be explicit about the signals, trust boundaries, and concrete controls you would rely on during diagnosis.