Tune Fraud Models with K-Fold CV

Business Context

OpenAI PayOps is building a binary classifier to flag potentially fraudulent API credit-card top-ups before settlement. The model will score transactions in near real time, so the team needs a reliable offline evaluation method for model selection and hyperparameter tuning without overfitting to a single validation split.

Dataset

You are given a tabular dataset of historical top-up events.

Feature Group	Count	Examples
Transaction features	12	amount_usd, card_bin_risk_score, retry_count, hour_of_day
Account features	9	account_age_days, org_size, prior_chargebacks, plan_tier
Behavioral aggregates	11	topups_last_24h, failed_payments_7d, avg_ticket_size_30d
Device / network	6	ip_country, vpn_flag, device_fingerprint_age, ASN
Labels / keys	3	transaction_id, event_date, is_fraud

Rows: 420K transactions collected over 9 months
Target: is_fraud = 1 if the transaction was later confirmed fraudulent
Class balance: 2.7% fraud, 97.3% legitimate
Missing data: ~8% missing in device fields, ~3% missing in behavioral aggregates for new accounts

Success Criteria

A good solution should explain k-fold cross-validation clearly, implement it correctly, and use it to compare at least two candidate models. The selected model should achieve PR-AUC >= 0.42 on a held-out test set and provide stable fold-to-fold performance with low variance.

Constraints

False positives create manual review cost, so precision matters
Fraud patterns drift over time; leakage must be avoided
Training should finish within 45 minutes on a standard CPU training job
The final approach must be simple enough to operationalize in a scheduled retraining pipeline

Deliverables

Explain standard k-fold cross-validation and when it is appropriate
Describe its trade-offs versus a single validation split and time-based validation
Build a training pipeline using cross-validation for hyperparameter tuning
Compare at least two models and justify the final choice
Report fold-level and test-set metrics, and discuss production implications

Business Context

Dataset

You are given a tabular dataset of historical top-up events.

Feature Group	Count	Examples
Transaction features	12	amount_usd, card_bin_risk_score, retry_count, hour_of_day
Account features	9	account_age_days, org_size, prior_chargebacks, plan_tier
Behavioral aggregates	11	topups_last_24h, failed_payments_7d, avg_ticket_size_30d
Device / network	6	ip_country, vpn_flag, device_fingerprint_age, ASN
Labels / keys	3	transaction_id, event_date, is_fraud

Rows: 420K transactions collected over 9 months
Target: is_fraud = 1 if the transaction was later confirmed fraudulent
Class balance: 2.7% fraud, 97.3% legitimate
Missing data: ~8% missing in device fields, ~3% missing in behavioral aggregates for new accounts

Success Criteria

Constraints

False positives create manual review cost, so precision matters
Fraud patterns drift over time; leakage must be avoided
Training should finish within 45 minutes on a standard CPU training job
The final approach must be simple enough to operationalize in a scheduled retraining pipeline

Deliverables

Explain standard k-fold cross-validation and when it is appropriate
Describe its trade-offs versus a single validation split and time-based validation
Build a training pipeline using cross-validation for hyperparameter tuning
Compare at least two models and justify the final choice
Report fold-level and test-set metrics, and discuss production implications

Business Context

Dataset

You are given a tabular dataset of historical top-up events.

Feature Group	Count	Examples
Transaction features	12	amount_usd, card_bin_risk_score, retry_count, hour_of_day
Account features	9	account_age_days, org_size, prior_chargebacks, plan_tier
Behavioral aggregates	11	topups_last_24h, failed_payments_7d, avg_ticket_size_30d
Device / network	6	ip_country, vpn_flag, device_fingerprint_age, ASN
Labels / keys	3	transaction_id, event_date, is_fraud

Rows: 420K transactions collected over 9 months
Target: is_fraud = 1 if the transaction was later confirmed fraudulent
Class balance: 2.7% fraud, 97.3% legitimate
Missing data: ~8% missing in device fields, ~3% missing in behavioral aggregates for new accounts

Success Criteria

Constraints

False positives create manual review cost, so precision matters
Fraud patterns drift over time; leakage must be avoided
Training should finish within 45 minutes on a standard CPU training job
The final approach must be simple enough to operationalize in a scheduled retraining pipeline

Deliverables

Explain standard k-fold cross-validation and when it is appropriate
Describe its trade-offs versus a single validation split and time-based validation
Build a training pipeline using cross-validation for hyperparameter tuning
Compare at least two models and justify the final choice
Report fold-level and test-set metrics, and discuss production implications

Business Context

Dataset

You are given a tabular dataset of historical top-up events.

Feature Group	Count	Examples
Transaction features	12	amount_usd, card_bin_risk_score, retry_count, hour_of_day
Account features	9	account_age_days, org_size, prior_chargebacks, plan_tier
Behavioral aggregates	11	topups_last_24h, failed_payments_7d, avg_ticket_size_30d
Device / network	6	ip_country, vpn_flag, device_fingerprint_age, ASN
Labels / keys	3	transaction_id, event_date, is_fraud

Rows: 420K transactions collected over 9 months
Target: is_fraud = 1 if the transaction was later confirmed fraudulent
Class balance: 2.7% fraud, 97.3% legitimate
Missing data: ~8% missing in device fields, ~3% missing in behavioral aggregates for new accounts

Success Criteria

Constraints

False positives create manual review cost, so precision matters
Fraud patterns drift over time; leakage must be avoided
Training should finish within 45 minutes on a standard CPU training job
The final approach must be simple enough to operationalize in a scheduled retraining pipeline

Deliverables

Explain standard k-fold cross-validation and when it is appropriate
Describe its trade-offs versus a single validation split and time-based validation
Build a training pipeline using cross-validation for hyperparameter tuning
Compare at least two models and justify the final choice
Report fold-level and test-set metrics, and discuss production implications

Interview Guides

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Tune Fraud Models with K-Fold CV

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Your Answer

Tune Fraud Models with K-Fold CV

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Tune Fraud Models with K-Fold CV

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Your Answer