Business Context
You’re interviewing for an ML engineering role on the Risk team at Stripe-likePay, a global payments platform processing ~25M card-not-present transactions/day across e-commerce and subscription merchants. Fraud losses and chargeback fees are material: a 5 bps increase in fraud rate can translate to $8–12M/month in direct losses and network penalties. The business wants a model that can block high-risk transactions in real time while minimizing false declines that hurt merchant conversion.
A staff engineer asks you to go beyond the textbook definition of a decision boundary and reason about how decision boundaries behave under real production constraints: class imbalance, drifting fraud patterns, calibration, and thresholding.
Dataset
You are given an offline training dataset built from 60 days of historical transactions.
| Feature Group | Count | Examples | Notes |
|---|
| Transaction | 18 | amount_usd, currency, merchant_category, is_recurring, hour_of_day | Heavy-tailed amounts; strong seasonality |
| Card & account | 14 | card_age_days, account_age_days, prior_chargebacks_30d, velocity_5m | Velocity features are sparse for new users |
| Device & network | 22 | device_fingerprint_hash, ip_asn, geo_distance_km, proxy_score | Categorical high-cardinality + noisy geo |
| Merchant | 10 | merchant_risk_tier, dispute_rate_90d, avg_ticket_size | Some features only available after onboarding |
- Size: ~120M labeled transactions (train), 10M (validation), 10M (test)
- Target:
is_fraud (1 if confirmed fraud/chargeback within 60 days, else 0)
- Class balance: 0.35% positive (fraud)
- Missingness: ~8% missing in device signals (ad blockers), 12% missing in merchant features for newly onboarded merchants
Success Criteria
Your model will be used to auto-decline transactions above a risk threshold and step-up (3DS / OTP) transactions in a middle band.
- Primary: Achieve recall ≥ 70% at precision ≥ 20% on the auto-decline segment (very imbalanced setting).
- Secondary: Maintain false decline rate ≤ 0.20% overall.
- Operational: Provide a clear explanation of how the decision boundary changes when you adjust thresholds and regularization.
Constraints
- Latency: p95 < 50 ms end-to-end scoring (feature retrieval + model).
- Interpretability: Risk ops needs reason codes; model must support post-hoc explanations (e.g., SHAP on a sample).
- Stability: Fraud patterns drift weekly; you must propose a monitoring plan tied to boundary movement.
- Deployment: Single model artifact, CPU-only, must handle high-cardinality categoricals safely.
Deliverables
- Explain what a decision boundary is for (a) logistic regression, (b) linear SVM, (c) gradient-boosted trees, and how it relates to thresholding and calibration.
- Propose an end-to-end modeling approach that explicitly addresses:
- extreme class imbalance
- missing data patterns
- high-cardinality categoricals
- drift and boundary stability
- Provide a Python implementation that trains a baseline and a stronger model, plots/inspects decision boundary behavior on a 2D slice, and evaluates with fraud-appropriate metrics.
- Describe how you would choose operating thresholds for auto-decline vs step-up, and what you would monitor in production to detect boundary shift.