Project Background
A large retail advertiser is sending server-side purchase events to Meta Conversions API to improve attribution for Facebook and Instagram campaigns. Over the last 72 hours, their integration has seen timeout rates rise from 1% to 18%, causing delayed event delivery, underreported conversions in Ads Manager, and pressure from the client’s media team ahead of a major weekend campaign.
You are the Solutions Architect responsible for driving the recovery plan across a cross-functional team of 8: 3 client engineers, 2 Meta partner engineers, 1 Technical Account Manager, 1 Product Specialist, and you. The client expects a clear path to restore performance within 10 business days without pausing campaign spend.
Key Stakeholders
- Client VP of Marketing wants attribution restored before the campaign launch and is pushing for the fastest possible fix.
- Client Engineering Manager wants root cause isolation before any production changes.
- Meta Account Team wants to protect campaign performance and avoid escalations.
- Meta Partner Engineering wants changes limited to supported Conversions API patterns and stable rollout practices.
Constraints
- Budget for external engineering support is capped at $40,000.
- No new headcount; only the existing 8-person team is available.
- The client processes 12M purchase-related events/day across 6 markets.
- Any production change must keep event loss under 0.5%.
- The weekend campaign starts in 14 calendar days.
Complications
- The client recently added a middleware layer for deduplication between browser Pixel and Conversions API events, but documentation is incomplete.
- Their infrastructure team has a freeze on major network changes for the next 7 days.
- The VP of Marketing is asking whether the team can temporarily increase request timeouts and ship immediately.
Your Task
- Build a 10-day execution plan to diagnose and fix the timeout issue.
- Define how you would prioritize root-cause analysis versus short-term mitigation.
- Propose a rollout and rollback plan for any fix touching production traffic.
- Identify the key risks, owners, and escalation points.
- Define success metrics for recovery before the campaign launch.