At Datadog, the internal Site Reliability Engineering (SRE) team is struggling with a slow and inconsistent incident triage process for customer-facing infrastructure alerts. Over the last two quarters, average time to assign an incident owner has risen from 8 minutes to 27 minutes, causing longer customer impact windows and repeated escalations from support and sales. You are the program manager assigned to improve the workflow and deliver a measurable process change within one quarter.
The working team includes 10 people: 4 SREs, 2 backend engineers, 1 analytics engineer, 1 support operations lead, 1 product manager, and you. The VP of Engineering wants visible improvement before the next enterprise renewal cycle in 12 weeks.
The VP of Engineering wants faster incident response and fewer executive escalations. The SRE manager wants a sustainable workflow that does not increase after-hours burden. The Support Director wants clearer handoffs and status visibility for customer-facing teams. The analytics engineer wants time to instrument better reporting, while backend engineers are pushing to reduce manual steps through automation.