You manage an engineering team for a B2B infrastructure software product, and the team presents a weekly dashboard covering deployment frequency, change failure rate, mean time to recovery, incident count, and service adoption. Over the last 8 weeks, deployment frequency improved from 18 to 29 per week, but change failure rate also rose from 6% to 11%, while 30-day adoption of a newly released observability feature has stayed flat at 24% despite two major releases. Leadership feels the team is “reporting metrics” but not changing decisions or behavior based on them, and retrospectives keep repeating the same issues. You need a metric system that helps the team learn, not just status-report.
How would you redesign the way the team uses these metrics so they drive learning, diagnosis, and better engineering decisions rather than passive reporting? What metrics, review structure, and feedback loops would you put in place?