You are the engineering manager for a tax calculation platform after a failed rollout of changes to the tax determination pipeline caused incorrect tax treatment for a subset of invoices in production. The issue was caught within hours and mitigated through a rollback, but 18 customers were affected, the support team is handling escalations, and leadership wants confidence that the same class of failure will not recur before the next planned release in three weeks. The incident involved multiple teams across tax engine, integrations, and customer operations, and there is already tension because one senior engineer believes the root cause was obvious while product leadership is focused on restoring customer trust quickly. You need to run a postmortem that produces real learning and concrete follow-through rather than a blame session or a document that gets ignored.
| Detail | Value |
|---|---|
| Affected customers | 18 |
| Time to detect | 2 hours |
| Time to rollback | 45 minutes |
| Teams involved | 3 engineering teams + support + customer operations |
| Engineers available for follow-up | 6 |
| Next scheduled release | 3 weeks |
| Compliance review required | Yes, for customer-facing remediation |
| Executive update deadline | 48 hours |
How would you structure and execute the postmortem so the team actually learns from the failure and the resulting actions are prioritized, owned, and completed without derailing the next release?