Nimbus builds a B2B workflow platform with 120 engineers across 10 squads. Over the last two quarters, customer-reported incidents rose from 14 to 23 per quarter, while roadmap delivery slipped from 86% to 71% of committed work, and engineering leadership wants a clear KPI framework for operational health.
The CTO says teams are tracking too many disconnected metrics: deployment frequency ranges from 3 to 18 deploys per week by squad, median PR cycle time increased from 18 hours to 31 hours, change failure rate rose from 9% to 15%, and mean time to restore (MTTR) increased from 52 minutes to 95 minutes. At the same time, voluntary engineer attrition is still low at 6% annually, and quarterly engagement survey scores are stable at 7.8/10, creating confusion about whether engineering operations are actually healthy.
You are asked to define the small set of metrics you would rely on most heavily, explain how they fit together, and show how you would diagnose the recent deterioration.
deployments: deployment_id, service_id, squad_id, deployed_at, status, rollback_flagpull_requests: pr_id, repo_id, squad_id, opened_at, first_review_at, merged_at, lines_changedincidents: incident_id, service_id, severity, started_at, resolved_at, root_causesprint_commitments: squad_id, sprint_id, committed_story_points, completed_story_pointseng_survey: engineer_id, quarter, engagement_score, intent_to_stay