Decision Boundaries for Payment Fraud

Business Context

You’re interviewing for an ML engineering role on the Risk team at Stripe-likePay, a global payments platform processing ~25M card-not-present transactions/day across e-commerce and subscription merchants. Fraud losses and chargeback fees are material: a 5 bps increase in fraud rate can translate to $8–12M/month in direct losses and network penalties. The business wants a model that can block high-risk transactions in real time while minimizing false declines that hurt merchant conversion.

A staff engineer asks you to go beyond the textbook definition of a decision boundary and reason about how decision boundaries behave under real production constraints: class imbalance, drifting fraud patterns, calibration, and thresholding.

Dataset

You are given an offline training dataset built from 60 days of historical transactions.

Feature Group	Count	Examples	Notes
Transaction	18	amount_usd, currency, merchant_category, is_recurring, hour_of_day	Heavy-tailed amounts; strong seasonality
Card & account	14	card_age_days, account_age_days, prior_chargebacks_30d, velocity_5m	Velocity features are sparse for new users
Device & network	22	device_fingerprint_hash, ip_asn, geo_distance_km, proxy_score	Categorical high-cardinality + noisy geo
Merchant	10	merchant_risk_tier, dispute_rate_90d, avg_ticket_size	Some features only available after onboarding

Size: ~120M labeled transactions (train), 10M (validation), 10M (test)
Target: is_fraud (1 if confirmed fraud/chargeback within 60 days, else 0)
Class balance: 0.35% positive (fraud)
Missingness: ~8% missing in device signals (ad blockers), 12% missing in merchant features for newly onboarded merchants

Success Criteria

Your model will be used to auto-decline transactions above a risk threshold and step-up (3DS / OTP) transactions in a middle band.

Primary: Achieve recall ≥ 70% at precision ≥ 20% on the auto-decline segment (very imbalanced setting).
Secondary: Maintain false decline rate ≤ 0.20% overall.
Operational: Provide a clear explanation of how the decision boundary changes when you adjust thresholds and regularization.

Constraints

Latency: p95 < 50 ms end-to-end scoring (feature retrieval + model).
Interpretability: Risk ops needs reason codes; model must support post-hoc explanations (e.g., SHAP on a sample).
Stability: Fraud patterns drift weekly; you must propose a monitoring plan tied to boundary movement.
Deployment: Single model artifact, CPU-only, must handle high-cardinality categoricals safely.

Deliverables

Explain what a decision boundary is for (a) logistic regression, (b) linear SVM, (c) gradient-boosted trees, and how it relates to thresholding and calibration.
Propose an end-to-end modeling approach that explicitly addresses:
- extreme class imbalance
- missing data patterns
- high-cardinality categoricals
- drift and boundary stability
Provide a Python implementation that trains a baseline and a stronger model, plots/inspects decision boundary behavior on a 2D slice, and evaluates with fraud-appropriate metrics.
Describe how you would choose operating thresholds for auto-decline vs step-up, and what you would monitor in production to detect boundary shift.

Business Context

Dataset

You are given an offline training dataset built from 60 days of historical transactions.

Feature Group	Count	Examples	Notes
Transaction	18	amount_usd, currency, merchant_category, is_recurring, hour_of_day	Heavy-tailed amounts; strong seasonality
Card & account	14	card_age_days, account_age_days, prior_chargebacks_30d, velocity_5m	Velocity features are sparse for new users
Device & network	22	device_fingerprint_hash, ip_asn, geo_distance_km, proxy_score	Categorical high-cardinality + noisy geo
Merchant	10	merchant_risk_tier, dispute_rate_90d, avg_ticket_size	Some features only available after onboarding

Size: ~120M labeled transactions (train), 10M (validation), 10M (test)
Target: is_fraud (1 if confirmed fraud/chargeback within 60 days, else 0)
Class balance: 0.35% positive (fraud)
Missingness: ~8% missing in device signals (ad blockers), 12% missing in merchant features for newly onboarded merchants

Success Criteria

Your model will be used to auto-decline transactions above a risk threshold and step-up (3DS / OTP) transactions in a middle band.

Primary: Achieve recall ≥ 70% at precision ≥ 20% on the auto-decline segment (very imbalanced setting).
Secondary: Maintain false decline rate ≤ 0.20% overall.
Operational: Provide a clear explanation of how the decision boundary changes when you adjust thresholds and regularization.

Constraints

Latency: p95 < 50 ms end-to-end scoring (feature retrieval + model).
Interpretability: Risk ops needs reason codes; model must support post-hoc explanations (e.g., SHAP on a sample).
Stability: Fraud patterns drift weekly; you must propose a monitoring plan tied to boundary movement.
Deployment: Single model artifact, CPU-only, must handle high-cardinality categoricals safely.

Deliverables

Explain what a decision boundary is for (a) logistic regression, (b) linear SVM, (c) gradient-boosted trees, and how it relates to thresholding and calibration.
Propose an end-to-end modeling approach that explicitly addresses:
- extreme class imbalance
- missing data patterns
- high-cardinality categoricals
- drift and boundary stability
Provide a Python implementation that trains a baseline and a stronger model, plots/inspects decision boundary behavior on a 2D slice, and evaluates with fraud-appropriate metrics.
Describe how you would choose operating thresholds for auto-decline vs step-up, and what you would monitor in production to detect boundary shift.

Business Context

Dataset

You are given an offline training dataset built from 60 days of historical transactions.

Feature Group	Count	Examples	Notes
Transaction	18	amount_usd, currency, merchant_category, is_recurring, hour_of_day	Heavy-tailed amounts; strong seasonality
Card & account	14	card_age_days, account_age_days, prior_chargebacks_30d, velocity_5m	Velocity features are sparse for new users
Device & network	22	device_fingerprint_hash, ip_asn, geo_distance_km, proxy_score	Categorical high-cardinality + noisy geo
Merchant	10	merchant_risk_tier, dispute_rate_90d, avg_ticket_size	Some features only available after onboarding

Size: ~120M labeled transactions (train), 10M (validation), 10M (test)
Target: is_fraud (1 if confirmed fraud/chargeback within 60 days, else 0)
Class balance: 0.35% positive (fraud)
Missingness: ~8% missing in device signals (ad blockers), 12% missing in merchant features for newly onboarded merchants

Success Criteria

Your model will be used to auto-decline transactions above a risk threshold and step-up (3DS / OTP) transactions in a middle band.

Primary: Achieve recall ≥ 70% at precision ≥ 20% on the auto-decline segment (very imbalanced setting).
Secondary: Maintain false decline rate ≤ 0.20% overall.
Operational: Provide a clear explanation of how the decision boundary changes when you adjust thresholds and regularization.

Constraints

Latency: p95 < 50 ms end-to-end scoring (feature retrieval + model).
Interpretability: Risk ops needs reason codes; model must support post-hoc explanations (e.g., SHAP on a sample).
Stability: Fraud patterns drift weekly; you must propose a monitoring plan tied to boundary movement.
Deployment: Single model artifact, CPU-only, must handle high-cardinality categoricals safely.

Deliverables

Explain what a decision boundary is for (a) logistic regression, (b) linear SVM, (c) gradient-boosted trees, and how it relates to thresholding and calibration.
Propose an end-to-end modeling approach that explicitly addresses:
- extreme class imbalance
- missing data patterns
- high-cardinality categoricals
- drift and boundary stability
Provide a Python implementation that trains a baseline and a stronger model, plots/inspects decision boundary behavior on a 2D slice, and evaluates with fraud-appropriate metrics.
Describe how you would choose operating thresholds for auto-decline vs step-up, and what you would monitor in production to detect boundary shift.

Business Context

Dataset

You are given an offline training dataset built from 60 days of historical transactions.

Feature Group	Count	Examples	Notes
Transaction	18	amount_usd, currency, merchant_category, is_recurring, hour_of_day	Heavy-tailed amounts; strong seasonality
Card & account	14	card_age_days, account_age_days, prior_chargebacks_30d, velocity_5m	Velocity features are sparse for new users
Device & network	22	device_fingerprint_hash, ip_asn, geo_distance_km, proxy_score	Categorical high-cardinality + noisy geo
Merchant	10	merchant_risk_tier, dispute_rate_90d, avg_ticket_size	Some features only available after onboarding

Size: ~120M labeled transactions (train), 10M (validation), 10M (test)
Target: is_fraud (1 if confirmed fraud/chargeback within 60 days, else 0)
Class balance: 0.35% positive (fraud)
Missingness: ~8% missing in device signals (ad blockers), 12% missing in merchant features for newly onboarded merchants

Success Criteria

Your model will be used to auto-decline transactions above a risk threshold and step-up (3DS / OTP) transactions in a middle band.

Primary: Achieve recall ≥ 70% at precision ≥ 20% on the auto-decline segment (very imbalanced setting).
Secondary: Maintain false decline rate ≤ 0.20% overall.
Operational: Provide a clear explanation of how the decision boundary changes when you adjust thresholds and regularization.

Constraints

Latency: p95 < 50 ms end-to-end scoring (feature retrieval + model).
Interpretability: Risk ops needs reason codes; model must support post-hoc explanations (e.g., SHAP on a sample).
Stability: Fraud patterns drift weekly; you must propose a monitoring plan tied to boundary movement.
Deployment: Single model artifact, CPU-only, must handle high-cardinality categoricals safely.

Deliverables

Explain what a decision boundary is for (a) logistic regression, (b) linear SVM, (c) gradient-boosted trees, and how it relates to thresholding and calibration.
Propose an end-to-end modeling approach that explicitly addresses:
- extreme class imbalance
- missing data patterns
- high-cardinality categoricals
- drift and boundary stability
Provide a Python implementation that trains a baseline and a stronger model, plots/inspects decision boundary behavior on a 2D slice, and evaluates with fraud-appropriate metrics.
Describe how you would choose operating thresholds for auto-decline vs step-up, and what you would monitor in production to detect boundary shift.

Interview Guides

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Decision Boundaries for Payment Fraud

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Your Answer

Decision Boundaries for Payment Fraud

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Decision Boundaries for Payment Fraud

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Your Answer