Project Context
You are the program manager embedded with Google One’s Growth & Retention team. Google One has 35M paid subscribers globally and is a material contributor to Google’s consumer subscription revenue. Your team owns the subscription lifecycle: upgrade, renewal, downgrade, cancellation, and win-back.
At 08:45 PT on a Tuesday, the executive retention dashboard used by the VP of Subscriptions flags a sudden +2.4 percentage point increase in 7-day churn for new subscribers (from 4.8% to 7.2%) starting “yesterday.” The alert is based on near-real-time event ingestion and is considered a P0 business incident because a sustained increase at this magnitude would translate to an estimated $3.5M–$5M in monthly recurring revenue (MRR) at risk.
The timing is uncomfortable: over the last 72 hours, three changes went out:
- Android Play Billing library upgrade (v5 → v6) to meet a compliance deadline.
- A new cancellation flow experiment (treatment: fewer steps; different copy) ramped from 10% to 50% of eligible users.
- A backend change to the entitlements service to reduce latency (cache layer + TTL changes).
Your cross-functional “incident pod” is:
| Function | Count | Notes |
|---|
| Backend Engineers | 4 | Own entitlements + subscription state machine |
| Android Engineers | 3 | Own Play Billing integration |
| Data Scientist | 1 | Owns churn metric definitions + anomaly detection |
| Data Engineer | 1 | Owns event pipeline + dashboard tables |
| Product Manager | 1 | Owns cancellation flow + experiment |
| Customer Support Ops | 1 | Sees ticket spikes and refund reasons |
| Finance Partner | 1 | Tracks refunds/chargebacks and revenue recognition |
You are accountable for driving the debugging process end-to-end, aligning stakeholders on trade-offs, and landing a decision (rollback, pause ramp, hotfix, or “false alarm”) within tight timelines.
Stakeholder Landscape
- VP of Subscriptions: Wants a clear answer within 4 hours: “Is this real churn or a measurement issue, and what are we doing right now?” Low tolerance for ambiguity.
- Android Eng Director: Worried about rolling back the Play Billing upgrade because it risks missing the compliance deadline and reintroducing known payment bugs.
- PM for Cancellation Flow: Believes the new flow should reduce churn; suspects the dashboard is wrong or the cohort definition changed.
- Finance: Concerned about refunds and chargebacks spiking (which can indicate failed renewals or accidental cancellations).
- Customer Support: Reports an uptick in tickets tagged “lost access” and “charged but no storage,” which could indicate entitlements issues rather than true churn.
You must manage competing narratives and avoid thrash while still moving quickly.
Constraints
- Decision SLA: You must recommend an action (pause ramp / rollback / hotfix / monitor) within 6 hours of the initial alert.
- Rollback Risk: Rolling back the entitlements cache change is low risk (server-side). Rolling back Play Billing is high risk and requires an app update for some users.
- Data Latency: The churn dashboard is built on a pipeline with 15–45 minutes latency; late-arriving events are common in some markets.
- Market Complexity: The spike appears strongest in Brazil and India, where payment methods and network reliability differ.
- Access & Privacy: You cannot access raw payment instrument data; only aggregated and tokenized identifiers are available.
What You Need to Deliver (Candidate Tasks)
- Triage & Debug Plan (first 60 minutes): Walk through the exact steps you would take to validate the signal, isolate likely causes, and assign owners.
- Hypothesis Tree: Provide a structured set of hypotheses across (a) real product behavior, (b) instrumentation/metric definition, (c) pipeline/data quality, and (d) external factors (billing outages, partner incidents).
- Decision Framework: Define how you will decide between rollback vs. forward fix vs. pausing the experiment ramp, including what evidence thresholds you need.
- Stakeholder Comms Plan: Draft what you would send to the VP and cross-functional leads at T+1 hour and T+4 hours (content, not prose-perfect).
- Post-Incident Actions: Define what permanent fixes you would implement (monitoring, guardrails, runbooks, experiment kill-switches) to prevent recurrence.
Complications (Realistic Twists)
- Conflicting Metrics: The churn dashboard shows a spike, but the billing success rate dashboard looks flat. Meanwhile, support tickets for “lost access” are up 18%.
- Experiment Ramp Pressure: The PM wants to ramp the cancellation experiment to 100% by end of week to hit a quarterly goal. The Eng Director wants to avoid any rollback that could destabilize billing.
- Data Pipeline Change: You learn the data engineering team deployed a schema evolution to the subscription events table two days ago to support new fields for Play Billing v6.
Success Criteria
Your answer will be evaluated on whether you:
- Validate whether the churn spike is real vs. artifact quickly and methodically.
- Use a crisp hypothesis-driven approach with clear owners and checkpoints.
- Make a defensible decision under uncertainty with an explicit rollback plan.
- Communicate clearly to execs and partners, balancing urgency and accuracy.
- Put in place durable guardrails (metric definitions, anomaly alerts, experiment safety checks, and incident runbooks).