Business Context
You’re interviewing for a Senior ML Engineer role on the Risk team at SwiftPay, a global card processor handling ~45M transactions/day across North America and Europe. Fraud is rare but costly: false negatives create direct chargeback losses and regulatory scrutiny, while false positives cause customer friction and support costs. The team needs a model that scores transactions in real time to decide whether to approve, decline, or step-up authenticate.
Fraud prevalence is highly imbalanced and non-stationary: overall fraud is ~0.25%, but it spikes to 1–2% in certain merchant categories and during coordinated attacks. You are asked to explain the challenges of working with imbalanced datasets and to propose a production-ready modeling and evaluation plan.
Dataset
You are given 6 months of labeled transactions (labels arrive with delay due to chargeback windows).
| Feature Group | Count | Examples | Notes |
|---|
| Transaction attributes | 18 | amount, currency, card_present, entry_mode, mcc, merchant_country | amount is heavy-tailed; some categorical drift |
| Customer behavior | 22 | tx_count_1h/24h/7d, avg_amount_30d, new_merchant_flag | aggregation windows computed at event time |
| Merchant risk | 10 | merchant_fraud_rate_7d, merchant_velocity | leakage risk if computed incorrectly |
| Device/network | 14 | device_id_hash, ip_asn, geo_distance_km | missingness varies by channel |
| Auth signals | 7 | 3ds_attempted, avs_result, cvv_result | not always present |
- Size: ~120M transactions, 71 features (numerical + high-cardinality categorical)
- Target:
is_fraud (1 if confirmed fraud/chargeback, else 0)
- Class balance: 0.25% positive, 99.75% negative (≈ 1:400)
- Missing data: 3DS and AVS/CVV fields missing for ~35% (channel-dependent); device_id missing for ~8%
Success Criteria (business-facing)
Your model is used to trigger a step-up flow (not an auto-decline) and must meet:
- Recall ≥ 70% on fraud at the chosen operating point.
- False positive rate (FPR) ≤ 0.30% overall (to protect customer experience).
- Lift in top 0.5% risk bucket ≥ 8× versus random selection (so analysts can investigate effectively).
- Stable performance across key slices: card_present vs card_not_present, top 10 MCCs, and top 5 countries.
Constraints
- Latency: p99 inference < 50 ms (Python service + feature store lookup).
- Training: daily retrain allowed; full training budget < 2 hours on a single 32-core machine.
- Label delay: fraud labels can arrive 7–60 days later; you must avoid leakage.
- Interpretability: risk ops requires reason codes (top contributing features) for escalations.
Deliverables (what you must produce in the interview)
- Explain the core challenges of imbalanced datasets in this setting (learning dynamics, metrics, thresholding, slice risk, label noise, drift).
- Propose an end-to-end modeling approach (baseline + improved model), including imbalance handling.
- Define an evaluation plan: split strategy, metrics, and how you will pick an operating threshold.
- Describe how you would validate the model in production (monitoring, drift, calibration, alerting).
- Provide a short implementation sketch (training + evaluation) and justify key choices.