Project Background
You are the program manager for Square’s Cash App Pay risk platform. Cash App Pay is expanding deeper into online checkout for mid-market merchants (fashion, electronics, ticketing) and is currently processing ~$1.2B/month in card-not-present volume in the US. Chargebacks have increased over the last two quarters, and the CFO has set a goal to reduce fraud losses before peak season.
The Risk ML team has built a new real-time fraud decisioning model (“Falcon v2”) that is intended to replace the current rules-heavy system (“Falcon v1”). In offline evaluation, Falcon v2 reduces fraud chargebacks by 18% at the same overall approval rate. However, during fairness analysis, the Data Science lead found a statistically meaningful disparity: for a segment inferred as Spanish-language preference (based on app language + merchant locale signals), the model increases declines by +2.4 percentage points compared to Falcon v1. Legal flags that this could be interpreted as discriminatory impact under certain state consumer protection frameworks, even if the feature is not explicitly protected-class data.
You have 10 weeks until a contractual commitment with a top merchant aggregator goes live. The aggregator represents ~9% of Cash App Pay monthly volume and has a clause that allows them to renegotiate fees if chargeback rates exceed a threshold during the first 60 days. Engineering believes Falcon v2 is the fastest path to meeting the chargeback target, but Legal and Comms are concerned about reputational and regulatory risk if the disparity becomes public.
Stakeholder Landscape
- Risk ML (Data Science + ML Engineering) wants to launch Falcon v2 quickly to stop rising losses and prove the new platform. They argue the disparity may be due to label bias and that the model is still more accurate overall.
- Payments Engineering owns the real-time decisioning service (150ms p95 SLA). They are already committed to a separate tokenization migration and can only allocate limited on-call coverage for a risky rollout.
- Legal/Compliance is conservative: they want a documented disparate-impact assessment, mitigation plan, and executive sign-off before any launch. They prefer delaying the launch rather than accepting uncertain exposure.
- Merchant Partnerships / Sales needs a credible story for the aggregator and wants the lowest possible false declines to avoid merchant churn and support tickets.
- Customer Support + Trust & Safety worries about increased “payment declined” contacts and social media escalation if a particular community is disproportionately affected.
You are accountable for aligning these groups, making a recommendation, and executing the plan.
Constraints
| Constraint | Details |
|---|
| Timeline | 10 weeks until aggregator go-live; 2-week code freeze before peak season |
| Team | 6 backend engineers (2 shared with tokenization), 4 DS/ML, 1 designer (part-time), 1 analytics, 1 legal counsel (part-time) |
| Runtime | Decisioning must stay under 150ms p95; no new synchronous vendor calls |
| Data | Cannot use explicit protected-class attributes; only existing first-party signals allowed |
| Risk | Any production incident affecting approvals >1% for >30 minutes triggers executive escalation |
| Budget | $120K available for external audit/consulting and tooling; no new headcount |
What You Need to Deliver (Candidate Tasks)
- Launch recommendation and trade-off framing: Decide whether to (a) launch Falcon v2 as-is, (b) delay launch until mitigations are implemented, or (c) do a staged/limited rollout with guardrails. Explain your reasoning and what you would communicate to execs.
- Execution plan: Provide a week-by-week plan that includes engineering, DS, legal review, and operational readiness (monitoring, support playbooks).
- Ethical dilemma handling: Describe how you would structure the ethical review: what questions you ask, what evidence you require, who must sign off, and how you document the decision.
- Risk mitigation and rollback: Define concrete launch guardrails, monitoring, and rollback triggers (e.g., disparity thresholds, approval-rate drops, complaint spikes).
- Success criteria: Define measurable success metrics for fraud reduction, customer experience, and fairness, including what “good” looks like in the first 7/30/60 days.
Complications (Assume These Are True)
- Feature leak risk: A journalist has previously covered “algorithmic bias” in fintech declines. Comms says if this launches and the disparity is discovered, you may have 48 hours to respond publicly.
- Engineering capacity shock: Two weeks into the project, the tokenization migration has a Sev-2 incident and pulls one of your key backend engineers for at least 10 business days.
- Aggregator pressure: The aggregator’s CTO requests an earlier pilot in Week 6, threatening to delay their integration if they can’t validate fraud performance before their own code freeze.
Your answer should demonstrate how you drive alignment, make principled trade-offs, and execute under time pressure—while treating the fairness concern as a first-class launch requirement, not a footnote.