Diagnose Learning Curves for Fraud Model

Business Context

You’re on the Risk ML team at PayWave, a fintech payment processor handling ~35M card-not-present transactions/day across North America and the EU. Fraud losses are trending upward after a product expansion into new merchant categories. The business wants a model that flags fraudulent transactions in near real time so that downstream rules and step-up authentication can reduce chargebacks without causing excessive customer friction.

A previous team trained a gradient-boosted tree model and reported “good AUC,” but in production the model’s performance is unstable across weeks and new merchants. Your task is to use learning curves (training vs validation performance as a function of training set size) to diagnose whether the system is data-limited, high-bias, or high-variance, and to propose concrete next steps.

Dataset

You have a historical labeled dataset built from chargeback outcomes and manual review.

Feature Group	Count	Examples	Notes
Transaction attributes	18	amount, currency, merchant_category, channel, local_hour	Some categorical high-cardinality
Customer behavior aggregates	22	txns_1h, txns_24h, avg_amount_30d, device_count_7d	Aggregates computed at event time
Merchant risk signals	9	merchant_age_days, dispute_rate_90d, mcc_risk_score	Sparse for new merchants
Device / network	14	device_fingerprint_hash, ip_asn, proxy_flag	Missingness varies by region
Geo / compliance	6	country, region, sanctions_match_flag	Must be explainable

Size: ~120M transactions over 6 months, 69 features
Target: Binary — fraud (1) if chargeback or confirmed review within 60 days
Class balance: 0.35% fraud (highly imbalanced)
Missing data: ~10% missing in device signals (privacy settings), ~25% missing in merchant aggregates for new merchants

Success Criteria

Operational goal: At a decision threshold that triggers step-up auth, achieve ≥ 70% recall with ≤ 2.0% false positive rate (FPR) on a held-out time window.
Stability: Metric drift week-over-week should be explainable; you must propose monitoring tied to the learning-curve diagnosis.
Interpretability: Provide a path to explain top drivers (e.g., SHAP) for compliance and merchant support.

Constraints

Latency: p95 inference budget < 25 ms per transaction (online scoring).
Training budget: daily training job limited to 8 CPU-hours (or equivalent); you can sample.
Data leakage risk: Aggregates must be computed using only information available before the transaction timestamp.
Temporal generalization: Evaluation must be time-based (no random split).

Deliverables

Define “learning curve” in the context of this fraud system (what is on each axis, what curves you plot, and why).
Build learning curves for at least two models (e.g., regularized logistic regression and gradient-boosted trees) using a time-based split.
Interpret the curves to diagnose bias vs variance vs data limitation and propose 3 concrete interventions (e.g., feature work, regularization, more data, model change).
Specify which metric(s) you would plot for learning curves given the class imbalance (and why accuracy is misleading).
Provide an experiment plan: what you would try in the next week to hit the success criteria while staying within latency/training constraints.

Business Context

Dataset

You have a historical labeled dataset built from chargeback outcomes and manual review.

Feature Group	Count	Examples	Notes
Transaction attributes	18	amount, currency, merchant_category, channel, local_hour	Some categorical high-cardinality
Customer behavior aggregates	22	txns_1h, txns_24h, avg_amount_30d, device_count_7d	Aggregates computed at event time
Merchant risk signals	9	merchant_age_days, dispute_rate_90d, mcc_risk_score	Sparse for new merchants
Device / network	14	device_fingerprint_hash, ip_asn, proxy_flag	Missingness varies by region
Geo / compliance	6	country, region, sanctions_match_flag	Must be explainable

Size: ~120M transactions over 6 months, 69 features
Target: Binary — fraud (1) if chargeback or confirmed review within 60 days
Class balance: 0.35% fraud (highly imbalanced)
Missing data: ~10% missing in device signals (privacy settings), ~25% missing in merchant aggregates for new merchants

Success Criteria

Operational goal: At a decision threshold that triggers step-up auth, achieve ≥ 70% recall with ≤ 2.0% false positive rate (FPR) on a held-out time window.
Stability: Metric drift week-over-week should be explainable; you must propose monitoring tied to the learning-curve diagnosis.
Interpretability: Provide a path to explain top drivers (e.g., SHAP) for compliance and merchant support.

Constraints

Latency: p95 inference budget < 25 ms per transaction (online scoring).
Training budget: daily training job limited to 8 CPU-hours (or equivalent); you can sample.
Data leakage risk: Aggregates must be computed using only information available before the transaction timestamp.
Temporal generalization: Evaluation must be time-based (no random split).

Deliverables

Define “learning curve” in the context of this fraud system (what is on each axis, what curves you plot, and why).
Build learning curves for at least two models (e.g., regularized logistic regression and gradient-boosted trees) using a time-based split.
Interpret the curves to diagnose bias vs variance vs data limitation and propose 3 concrete interventions (e.g., feature work, regularization, more data, model change).
Specify which metric(s) you would plot for learning curves given the class imbalance (and why accuracy is misleading).
Provide an experiment plan: what you would try in the next week to hit the success criteria while staying within latency/training constraints.

Business Context

Dataset

You have a historical labeled dataset built from chargeback outcomes and manual review.

Feature Group	Count	Examples	Notes
Transaction attributes	18	amount, currency, merchant_category, channel, local_hour	Some categorical high-cardinality
Customer behavior aggregates	22	txns_1h, txns_24h, avg_amount_30d, device_count_7d	Aggregates computed at event time
Merchant risk signals	9	merchant_age_days, dispute_rate_90d, mcc_risk_score	Sparse for new merchants
Device / network	14	device_fingerprint_hash, ip_asn, proxy_flag	Missingness varies by region
Geo / compliance	6	country, region, sanctions_match_flag	Must be explainable

Size: ~120M transactions over 6 months, 69 features
Target: Binary — fraud (1) if chargeback or confirmed review within 60 days
Class balance: 0.35% fraud (highly imbalanced)
Missing data: ~10% missing in device signals (privacy settings), ~25% missing in merchant aggregates for new merchants

Success Criteria

Operational goal: At a decision threshold that triggers step-up auth, achieve ≥ 70% recall with ≤ 2.0% false positive rate (FPR) on a held-out time window.
Stability: Metric drift week-over-week should be explainable; you must propose monitoring tied to the learning-curve diagnosis.
Interpretability: Provide a path to explain top drivers (e.g., SHAP) for compliance and merchant support.

Constraints

Latency: p95 inference budget < 25 ms per transaction (online scoring).
Training budget: daily training job limited to 8 CPU-hours (or equivalent); you can sample.
Data leakage risk: Aggregates must be computed using only information available before the transaction timestamp.
Temporal generalization: Evaluation must be time-based (no random split).

Deliverables

Define “learning curve” in the context of this fraud system (what is on each axis, what curves you plot, and why).
Build learning curves for at least two models (e.g., regularized logistic regression and gradient-boosted trees) using a time-based split.
Interpret the curves to diagnose bias vs variance vs data limitation and propose 3 concrete interventions (e.g., feature work, regularization, more data, model change).
Specify which metric(s) you would plot for learning curves given the class imbalance (and why accuracy is misleading).
Provide an experiment plan: what you would try in the next week to hit the success criteria while staying within latency/training constraints.

Business Context

Dataset

You have a historical labeled dataset built from chargeback outcomes and manual review.

Feature Group	Count	Examples	Notes
Transaction attributes	18	amount, currency, merchant_category, channel, local_hour	Some categorical high-cardinality
Customer behavior aggregates	22	txns_1h, txns_24h, avg_amount_30d, device_count_7d	Aggregates computed at event time
Merchant risk signals	9	merchant_age_days, dispute_rate_90d, mcc_risk_score	Sparse for new merchants
Device / network	14	device_fingerprint_hash, ip_asn, proxy_flag	Missingness varies by region
Geo / compliance	6	country, region, sanctions_match_flag	Must be explainable

Size: ~120M transactions over 6 months, 69 features
Target: Binary — fraud (1) if chargeback or confirmed review within 60 days
Class balance: 0.35% fraud (highly imbalanced)
Missing data: ~10% missing in device signals (privacy settings), ~25% missing in merchant aggregates for new merchants

Success Criteria

Operational goal: At a decision threshold that triggers step-up auth, achieve ≥ 70% recall with ≤ 2.0% false positive rate (FPR) on a held-out time window.
Stability: Metric drift week-over-week should be explainable; you must propose monitoring tied to the learning-curve diagnosis.
Interpretability: Provide a path to explain top drivers (e.g., SHAP) for compliance and merchant support.

Constraints

Latency: p95 inference budget < 25 ms per transaction (online scoring).
Training budget: daily training job limited to 8 CPU-hours (or equivalent); you can sample.
Data leakage risk: Aggregates must be computed using only information available before the transaction timestamp.
Temporal generalization: Evaluation must be time-based (no random split).

Deliverables

Define “learning curve” in the context of this fraud system (what is on each axis, what curves you plot, and why).
Build learning curves for at least two models (e.g., regularized logistic regression and gradient-boosted trees) using a time-based split.
Interpret the curves to diagnose bias vs variance vs data limitation and propose 3 concrete interventions (e.g., feature work, regularization, more data, model change).
Specify which metric(s) you would plot for learning curves given the class imbalance (and why accuracy is misleading).
Provide an experiment plan: what you would try in the next week to hit the success criteria while staying within latency/training constraints.

Interview Guides

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Diagnose Learning Curves for Fraud Model

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Your Answer

Diagnose Learning Curves for Fraud Model

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Diagnose Learning Curves for Fraud Model

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Your Answer