You are the engineering manager owning a rewrite of a high-volume alerting workflow in a supply chain visibility platform. The current workflow powers exception detection and case creation used daily by enterprise customers, but it has become brittle: incident volume is rising, deploys are slow, and two recent customer escalations were traced to duplicate alerts and delayed processing. Leadership wants a customer-visible launch of the new workflow in 10 weeks to support a committed renewal, while your team is also carrying a backlog of reliability work in the existing system and one senior engineer is the only person who understands the current deduplication logic. Product is pushing for parity plus three new capabilities at launch, but platform engineering is warning that the event ingestion path will not meet the target load without foundational cleanup.
| Detail | Value |
|---|---|
| Engineers | 6 backend, 2 frontend, 1 QA |
| Current alert volume | 18M events/day |
| Enterprise customers affected | 42 |
| Deadline | 10 weeks |
| Availability target | 99.9% during rollout |
| Budget for external support | $75K |
| New launch asks | Parity + 3 new features |
| Key person risk | 1 senior engineer owns dedupe logic |
How would you plan and execute this project so you can meet the near-term launch need without creating unacceptable long-term engineering risk? Explain how you would make trade-offs on scope, sequence the work, manage stakeholders, and decide what must be fixed now versus deferred.