Improve Incident Triage Workflow

Project Context

At Datadog, the internal Site Reliability Engineering (SRE) team is struggling with a slow and inconsistent incident triage process for customer-facing infrastructure alerts. Over the last two quarters, average time to assign an incident owner has risen from 8 minutes to 27 minutes, causing longer customer impact windows and repeated escalations from support and sales. You are the program manager assigned to improve the workflow and deliver a measurable process change within one quarter.

The working team includes 10 people: 4 SREs, 2 backend engineers, 1 analytics engineer, 1 support operations lead, 1 product manager, and you. The VP of Engineering wants visible improvement before the next enterprise renewal cycle in 12 weeks.

Key Stakeholders

The VP of Engineering wants faster incident response and fewer executive escalations. The SRE manager wants a sustainable workflow that does not increase after-hours burden. The Support Director wants clearer handoffs and status visibility for customer-facing teams. The analytics engineer wants time to instrument better reporting, while backend engineers are pushing to reduce manual steps through automation.

Constraints

Timeline: 12 weeks
Budget: $95,000 total
Team capacity: no new headcount; backend engineers available only 40% each
Existing tooling must remain in PagerDuty, Jira, and Slack; no platform replacement allowed
At least 1 engineer must stay allocated to production support at all times

Complications

Incident severity definitions differ across SRE, Support, and Engineering, causing routing confusion.
A major infrastructure migration is scheduled in Week 7, reducing SRE capacity for two weeks.
The VP of Sales has asked for customer-visible improvements by Week 8, earlier than the full rollout target.

Deliverables

Create a 12-week execution plan to improve the incident triage workflow.
Define the process changes you would prioritize first, including trade-offs and scope cuts.
Propose stakeholder alignment steps to resolve conflicting severity definitions and ownership gaps.
Identify the top risks to delivery and adoption, with mitigation plans.
Define success metrics and a rollout approach, including how you would validate that the new workflow is working.

Project Context

Key Stakeholders

Constraints

Timeline: 12 weeks
Budget: $95,000 total
Team capacity: no new headcount; backend engineers available only 40% each
Existing tooling must remain in PagerDuty, Jira, and Slack; no platform replacement allowed
At least 1 engineer must stay allocated to production support at all times

Complications

Incident severity definitions differ across SRE, Support, and Engineering, causing routing confusion.
A major infrastructure migration is scheduled in Week 7, reducing SRE capacity for two weeks.
The VP of Sales has asked for customer-visible improvements by Week 8, earlier than the full rollout target.

Deliverables

Create a 12-week execution plan to improve the incident triage workflow.
Define the process changes you would prioritize first, including trade-offs and scope cuts.
Propose stakeholder alignment steps to resolve conflicting severity definitions and ownership gaps.
Identify the top risks to delivery and adoption, with mitigation plans.
Define success metrics and a rollout approach, including how you would validate that the new workflow is working.

Project Context

Key Stakeholders

Constraints

Timeline: 12 weeks
Budget: $95,000 total
Team capacity: no new headcount; backend engineers available only 40% each
Existing tooling must remain in PagerDuty, Jira, and Slack; no platform replacement allowed
At least 1 engineer must stay allocated to production support at all times

Complications

Incident severity definitions differ across SRE, Support, and Engineering, causing routing confusion.
A major infrastructure migration is scheduled in Week 7, reducing SRE capacity for two weeks.
The VP of Sales has asked for customer-visible improvements by Week 8, earlier than the full rollout target.

Deliverables

Create a 12-week execution plan to improve the incident triage workflow.
Define the process changes you would prioritize first, including trade-offs and scope cuts.
Propose stakeholder alignment steps to resolve conflicting severity definitions and ownership gaps.
Identify the top risks to delivery and adoption, with mitigation plans.
Define success metrics and a rollout approach, including how you would validate that the new workflow is working.

Project Context

Key Stakeholders

Constraints

Timeline: 12 weeks
Budget: $95,000 total
Team capacity: no new headcount; backend engineers available only 40% each
Existing tooling must remain in PagerDuty, Jira, and Slack; no platform replacement allowed
At least 1 engineer must stay allocated to production support at all times

Complications

Incident severity definitions differ across SRE, Support, and Engineering, causing routing confusion.
A major infrastructure migration is scheduled in Week 7, reducing SRE capacity for two weeks.
The VP of Sales has asked for customer-visible improvements by Week 8, earlier than the full rollout target.

Deliverables

Create a 12-week execution plan to improve the incident triage workflow.
Define the process changes you would prioritize first, including trade-offs and scope cuts.
Propose stakeholder alignment steps to resolve conflicting severity definitions and ownership gaps.
Identify the top risks to delivery and adoption, with mitigation plans.
Define success metrics and a rollout approach, including how you would validate that the new workflow is working.

Interview Guides

Project Context

Key Stakeholders

Constraints

Complications

Deliverables

Improve Incident Triage Workflow

Project Context

Key Stakeholders

Constraints

Complications

Deliverables

Improve Incident Triage Workflow

Project Context

Key Stakeholders

Constraints

Complications

Deliverables

Improve Incident Triage Workflow

Project Context

Key Stakeholders

Constraints

Complications

Deliverables