Business Context
You’re on the Risk ML team at PayWave, a fintech payment processor handling ~35M card-not-present transactions/day across North America and the EU. Fraud losses are trending upward after a product expansion into new merchant categories. The business wants a model that flags fraudulent transactions in near real time so that downstream rules and step-up authentication can reduce chargebacks without causing excessive customer friction.
A previous team trained a gradient-boosted tree model and reported “good AUC,” but in production the model’s performance is unstable across weeks and new merchants. Your task is to use learning curves (training vs validation performance as a function of training set size) to diagnose whether the system is data-limited, high-bias, or high-variance, and to propose concrete next steps.
Dataset
You have a historical labeled dataset built from chargeback outcomes and manual review.
| Feature Group | Count | Examples | Notes |
|---|
| Transaction attributes | 18 | amount, currency, merchant_category, channel, local_hour | Some categorical high-cardinality |
| Customer behavior aggregates | 22 | txns_1h, txns_24h, avg_amount_30d, device_count_7d | Aggregates computed at event time |
| Merchant risk signals | 9 | merchant_age_days, dispute_rate_90d, mcc_risk_score | Sparse for new merchants |
| Device / network | 14 | device_fingerprint_hash, ip_asn, proxy_flag | Missingness varies by region |
| Geo / compliance | 6 | country, region, sanctions_match_flag | Must be explainable |
- Size: ~120M transactions over 6 months, 69 features
- Target: Binary — fraud (1) if chargeback or confirmed review within 60 days
- Class balance: 0.35% fraud (highly imbalanced)
- Missing data: ~10% missing in device signals (privacy settings), ~25% missing in merchant aggregates for new merchants
Success Criteria
- Operational goal: At a decision threshold that triggers step-up auth, achieve ≥ 70% recall with ≤ 2.0% false positive rate (FPR) on a held-out time window.
- Stability: Metric drift week-over-week should be explainable; you must propose monitoring tied to the learning-curve diagnosis.
- Interpretability: Provide a path to explain top drivers (e.g., SHAP) for compliance and merchant support.
Constraints
- Latency: p95 inference budget < 25 ms per transaction (online scoring).
- Training budget: daily training job limited to 8 CPU-hours (or equivalent); you can sample.
- Data leakage risk: Aggregates must be computed using only information available before the transaction timestamp.
- Temporal generalization: Evaluation must be time-based (no random split).
Deliverables
- Define “learning curve” in the context of this fraud system (what is on each axis, what curves you plot, and why).
- Build learning curves for at least two models (e.g., regularized logistic regression and gradient-boosted trees) using a time-based split.
- Interpret the curves to diagnose bias vs variance vs data limitation and propose 3 concrete interventions (e.g., feature work, regularization, more data, model change).
- Specify which metric(s) you would plot for learning curves given the class imbalance (and why accuracy is misleading).
- Provide an experiment plan: what you would try in the next week to hit the success criteria while staying within latency/training constraints.