OpenAI PayOps is building a binary classifier to flag potentially fraudulent API credit-card top-ups before settlement. The model will score transactions in near real time, so the team needs a reliable offline evaluation method for model selection and hyperparameter tuning without overfitting to a single validation split.
You are given a tabular dataset of historical top-up events.
| Feature Group | Count | Examples |
|---|---|---|
| Transaction features | 12 | amount_usd, card_bin_risk_score, retry_count, hour_of_day |
| Account features | 9 | account_age_days, org_size, prior_chargebacks, plan_tier |
| Behavioral aggregates | 11 | topups_last_24h, failed_payments_7d, avg_ticket_size_30d |
| Device / network | 6 | ip_country, vpn_flag, device_fingerprint_age, ASN |
| Labels / keys | 3 | transaction_id, event_date, is_fraud |
is_fraud = 1 if the transaction was later confirmed fraudulentA good solution should explain k-fold cross-validation clearly, implement it correctly, and use it to compare at least two candidate models. The selected model should achieve PR-AUC >= 0.42 on a held-out test set and provide stable fold-to-fold performance with low variance.