Real-Time Card Fraud Under Imbalance

Business Context

You’re interviewing for a Senior ML Engineer role on the Risk team at SwiftPay, a global card processor handling ~45M transactions/day across North America and Europe. Fraud is rare but costly: false negatives create direct chargeback losses and regulatory scrutiny, while false positives cause customer friction and support costs. The team needs a model that scores transactions in real time to decide whether to approve, decline, or step-up authenticate.

Fraud prevalence is highly imbalanced and non-stationary: overall fraud is ~0.25%, but it spikes to 1–2% in certain merchant categories and during coordinated attacks. You are asked to explain the challenges of working with imbalanced datasets and to propose a production-ready modeling and evaluation plan.

Dataset

You are given 6 months of labeled transactions (labels arrive with delay due to chargeback windows).

Feature Group	Count	Examples	Notes
Transaction attributes	18	amount, currency, card_present, entry_mode, mcc, merchant_country	amount is heavy-tailed; some categorical drift
Customer behavior	22	tx_count_1h/24h/7d, avg_amount_30d, new_merchant_flag	aggregation windows computed at event time
Merchant risk	10	merchant_fraud_rate_7d, merchant_velocity	leakage risk if computed incorrectly
Device/network	14	device_id_hash, ip_asn, geo_distance_km	missingness varies by channel
Auth signals	7	3ds_attempted, avs_result, cvv_result	not always present

Size: ~120M transactions, 71 features (numerical + high-cardinality categorical)
Target: is_fraud (1 if confirmed fraud/chargeback, else 0)
Class balance: 0.25% positive, 99.75% negative (≈ 1:400)
Missing data: 3DS and AVS/CVV fields missing for ~35% (channel-dependent); device_id missing for ~8%

Success Criteria (business-facing)

Your model is used to trigger a step-up flow (not an auto-decline) and must meet:

Recall ≥ 70% on fraud at the chosen operating point.
False positive rate (FPR) ≤ 0.30% overall (to protect customer experience).
Lift in top 0.5% risk bucket ≥ 8× versus random selection (so analysts can investigate effectively).
Stable performance across key slices: card_present vs card_not_present, top 10 MCCs, and top 5 countries.

Constraints

Latency: p99 inference < 50 ms (Python service + feature store lookup).
Training: daily retrain allowed; full training budget < 2 hours on a single 32-core machine.
Label delay: fraud labels can arrive 7–60 days later; you must avoid leakage.
Interpretability: risk ops requires reason codes (top contributing features) for escalations.

Deliverables (what you must produce in the interview)

Explain the core challenges of imbalanced datasets in this setting (learning dynamics, metrics, thresholding, slice risk, label noise, drift).
Propose an end-to-end modeling approach (baseline + improved model), including imbalance handling.
Define an evaluation plan: split strategy, metrics, and how you will pick an operating threshold.
Describe how you would validate the model in production (monitoring, drift, calibration, alerting).
Provide a short implementation sketch (training + evaluation) and justify key choices.

Business Context

Dataset

You are given 6 months of labeled transactions (labels arrive with delay due to chargeback windows).

Feature Group	Count	Examples	Notes
Transaction attributes	18	amount, currency, card_present, entry_mode, mcc, merchant_country	amount is heavy-tailed; some categorical drift
Customer behavior	22	tx_count_1h/24h/7d, avg_amount_30d, new_merchant_flag	aggregation windows computed at event time
Merchant risk	10	merchant_fraud_rate_7d, merchant_velocity	leakage risk if computed incorrectly
Device/network	14	device_id_hash, ip_asn, geo_distance_km	missingness varies by channel
Auth signals	7	3ds_attempted, avs_result, cvv_result	not always present

Size: ~120M transactions, 71 features (numerical + high-cardinality categorical)
Target: is_fraud (1 if confirmed fraud/chargeback, else 0)
Class balance: 0.25% positive, 99.75% negative (≈ 1:400)
Missing data: 3DS and AVS/CVV fields missing for ~35% (channel-dependent); device_id missing for ~8%

Success Criteria (business-facing)

Your model is used to trigger a step-up flow (not an auto-decline) and must meet:

Recall ≥ 70% on fraud at the chosen operating point.
False positive rate (FPR) ≤ 0.30% overall (to protect customer experience).
Lift in top 0.5% risk bucket ≥ 8× versus random selection (so analysts can investigate effectively).
Stable performance across key slices: card_present vs card_not_present, top 10 MCCs, and top 5 countries.

Constraints

Latency: p99 inference < 50 ms (Python service + feature store lookup).
Training: daily retrain allowed; full training budget < 2 hours on a single 32-core machine.
Label delay: fraud labels can arrive 7–60 days later; you must avoid leakage.
Interpretability: risk ops requires reason codes (top contributing features) for escalations.

Deliverables (what you must produce in the interview)

Explain the core challenges of imbalanced datasets in this setting (learning dynamics, metrics, thresholding, slice risk, label noise, drift).
Propose an end-to-end modeling approach (baseline + improved model), including imbalance handling.
Define an evaluation plan: split strategy, metrics, and how you will pick an operating threshold.
Describe how you would validate the model in production (monitoring, drift, calibration, alerting).
Provide a short implementation sketch (training + evaluation) and justify key choices.

Business Context

Dataset

You are given 6 months of labeled transactions (labels arrive with delay due to chargeback windows).

Feature Group	Count	Examples	Notes
Transaction attributes	18	amount, currency, card_present, entry_mode, mcc, merchant_country	amount is heavy-tailed; some categorical drift
Customer behavior	22	tx_count_1h/24h/7d, avg_amount_30d, new_merchant_flag	aggregation windows computed at event time
Merchant risk	10	merchant_fraud_rate_7d, merchant_velocity	leakage risk if computed incorrectly
Device/network	14	device_id_hash, ip_asn, geo_distance_km	missingness varies by channel
Auth signals	7	3ds_attempted, avs_result, cvv_result	not always present

Size: ~120M transactions, 71 features (numerical + high-cardinality categorical)
Target: is_fraud (1 if confirmed fraud/chargeback, else 0)
Class balance: 0.25% positive, 99.75% negative (≈ 1:400)
Missing data: 3DS and AVS/CVV fields missing for ~35% (channel-dependent); device_id missing for ~8%

Success Criteria (business-facing)

Your model is used to trigger a step-up flow (not an auto-decline) and must meet:

Recall ≥ 70% on fraud at the chosen operating point.
False positive rate (FPR) ≤ 0.30% overall (to protect customer experience).
Lift in top 0.5% risk bucket ≥ 8× versus random selection (so analysts can investigate effectively).
Stable performance across key slices: card_present vs card_not_present, top 10 MCCs, and top 5 countries.

Constraints

Latency: p99 inference < 50 ms (Python service + feature store lookup).
Training: daily retrain allowed; full training budget < 2 hours on a single 32-core machine.
Label delay: fraud labels can arrive 7–60 days later; you must avoid leakage.
Interpretability: risk ops requires reason codes (top contributing features) for escalations.

Deliverables (what you must produce in the interview)

Explain the core challenges of imbalanced datasets in this setting (learning dynamics, metrics, thresholding, slice risk, label noise, drift).
Propose an end-to-end modeling approach (baseline + improved model), including imbalance handling.
Define an evaluation plan: split strategy, metrics, and how you will pick an operating threshold.
Describe how you would validate the model in production (monitoring, drift, calibration, alerting).
Provide a short implementation sketch (training + evaluation) and justify key choices.

Business Context

Dataset

You are given 6 months of labeled transactions (labels arrive with delay due to chargeback windows).

Feature Group	Count	Examples	Notes
Transaction attributes	18	amount, currency, card_present, entry_mode, mcc, merchant_country	amount is heavy-tailed; some categorical drift
Customer behavior	22	tx_count_1h/24h/7d, avg_amount_30d, new_merchant_flag	aggregation windows computed at event time
Merchant risk	10	merchant_fraud_rate_7d, merchant_velocity	leakage risk if computed incorrectly
Device/network	14	device_id_hash, ip_asn, geo_distance_km	missingness varies by channel
Auth signals	7	3ds_attempted, avs_result, cvv_result	not always present

Size: ~120M transactions, 71 features (numerical + high-cardinality categorical)
Target: is_fraud (1 if confirmed fraud/chargeback, else 0)
Class balance: 0.25% positive, 99.75% negative (≈ 1:400)
Missing data: 3DS and AVS/CVV fields missing for ~35% (channel-dependent); device_id missing for ~8%

Success Criteria (business-facing)

Your model is used to trigger a step-up flow (not an auto-decline) and must meet:

Recall ≥ 70% on fraud at the chosen operating point.
False positive rate (FPR) ≤ 0.30% overall (to protect customer experience).
Lift in top 0.5% risk bucket ≥ 8× versus random selection (so analysts can investigate effectively).
Stable performance across key slices: card_present vs card_not_present, top 10 MCCs, and top 5 countries.

Constraints

Latency: p99 inference < 50 ms (Python service + feature store lookup).
Training: daily retrain allowed; full training budget < 2 hours on a single 32-core machine.
Label delay: fraud labels can arrive 7–60 days later; you must avoid leakage.
Interpretability: risk ops requires reason codes (top contributing features) for escalations.

Deliverables (what you must produce in the interview)

Explain the core challenges of imbalanced datasets in this setting (learning dynamics, metrics, thresholding, slice risk, label noise, drift).
Propose an end-to-end modeling approach (baseline + improved model), including imbalance handling.
Define an evaluation plan: split strategy, metrics, and how you will pick an operating threshold.
Describe how you would validate the model in production (monitoring, drift, calibration, alerting).
Provide a short implementation sketch (training + evaluation) and justify key choices.

Interview Guides

Business Context

Dataset

Success Criteria (business-facing)

Constraints

Deliverables (what you must produce in the interview)

Real-Time Card Fraud Under Imbalance

Business Context

Dataset

Success Criteria (business-facing)

Constraints

Deliverables (what you must produce in the interview)

Your Answer

Real-Time Card Fraud Under Imbalance

Business Context

Dataset

Success Criteria (business-facing)

Constraints

Deliverables (what you must produce in the interview)

Real-Time Card Fraud Under Imbalance

Business Context

Dataset

Success Criteria (business-facing)

Constraints

Deliverables (what you must produce in the interview)

Your Answer