Detect Card Fraud with Imbalanced Data

Easy

Machine Learning

Asked at 9 companies9Supervised LearningCross-Validation

Also asked at

Problem

Business Context

PayLink processes roughly 8 million card transactions per day. Fraud is rare but expensive, so the risk team needs a model that catches fraudulent transactions without overwhelming manual reviewers with false positives.

Dataset

You are given a historical transaction dataset for binary classification: predict whether a transaction is fraudulent (is_fraud=1) or legitimate (is_fraud=0). The data is sampled from 6 months of production traffic.

Feature Group	Count	Examples
Transaction attributes	10	amount, merchant_category, payment_method, device_type
Customer behavior	8	transactions_24h, avg_amount_30d, chargebacks_90d, account_age_days
Velocity and risk signals	7	ip_risk_score, distance_from_home_km, failed_logins_7d, new_device_flag
Temporal/context	5	hour_of_day, day_of_week, is_weekend, country, currency

Size: 1.2M transactions, 30 features
Target: Binary — fraudulent transaction (1) vs legitimate transaction (0)
Class balance: 0.9% positive, 99.1% negative
Missing data: ~12% missing in distance_from_home_km, 7% in ip_risk_score, and sparse missingness in merchant metadata

Success Criteria

A good solution should achieve recall >= 0.85 on fraud while keeping precision >= 0.25 at the operating threshold. The model should also improve ranking quality enough to support manual review queues.

Constraints

Batch scoring every 5 minutes; per-row inference should remain low-latency
False negatives are costly, but false positives create reviewer load and customer friction
The fraud team needs feature-level explanations for flagged transactions

Deliverables

Build and justify a modeling approach for this highly imbalanced classification problem.
Show how you would preprocess missing values and mixed feature types.
Compare at least one baseline against a stronger model.
Choose evaluation metrics appropriate for class imbalance and explain threshold selection.
Describe how you would deploy, monitor, and retrain the model in production.

Problem

Business Context

Dataset

Feature Group	Count	Examples
Transaction attributes	10	amount, merchant_category, payment_method, device_type
Customer behavior	8	transactions_24h, avg_amount_30d, chargebacks_90d, account_age_days
Velocity and risk signals	7	ip_risk_score, distance_from_home_km, failed_logins_7d, new_device_flag
Temporal/context	5	hour_of_day, day_of_week, is_weekend, country, currency

Size: 1.2M transactions, 30 features
Target: Binary — fraudulent transaction (1) vs legitimate transaction (0)
Class balance: 0.9% positive, 99.1% negative
Missing data: ~12% missing in distance_from_home_km, 7% in ip_risk_score, and sparse missingness in merchant metadata

Success Criteria

Constraints

Batch scoring every 5 minutes; per-row inference should remain low-latency
False negatives are costly, but false positives create reviewer load and customer friction
The fraud team needs feature-level explanations for flagged transactions

Deliverables

Build and justify a modeling approach for this highly imbalanced classification problem.
Show how you would preprocess missing values and mixed feature types.
Compare at least one baseline against a stronger model.
Choose evaluation metrics appropriate for class imbalance and explain threshold selection.
Describe how you would deploy, monitor, and retrain the model in production.

Your answer

Try one AI text evaluation on us

Get structured feedback, scored against a 4-axis rubric. Premium unlocks unlimited.

0 wordstarget ~200

Up next

Detect Card Fraud with Imbalanced DataEasy ADetect Card Fraud with Imbalanced DataEasy

Detect Card Fraud with Imbalanced DataEasy

Next question

Dataset

Feature Group	Count	Examples
Transaction attributes	10	amount, merchant_category, payment_method, device_type
Customer behavior	8	transactions_24h, avg_amount_30d, chargebacks_90d, account_age_days
Velocity and risk signals	7	ip_risk_score, distance_from_home_km, failed_logins_7d, new_device_flag
Temporal/context	5	hour_of_day, day_of_week, is_weekend, country, currency

Size: 1.2M transactions, 30 features

Target: Binary — fraudulent transaction (1) vs legitimate transaction (0)

Class balance: 0.9% positive, 99.1% negative

Missing data: ~12% missing in distance_from_home_km, 7% in ip_risk_score, and sparse missingness in merchant metadata

Deliverables

Build and justify a modeling approach for this highly imbalanced classification problem.

Show how you would preprocess missing values and mixed feature types.

Compare at least one baseline against a stronger model.

Choose evaluation metrics appropriate for class imbalance and explain threshold selection.

Describe how you would deploy, monitor, and retrain the model in production.

Dataset

Feature Group	Count	Examples
Transaction attributes	10	amount, merchant_category, payment_method, device_type
Customer behavior	8	transactions_24h, avg_amount_30d, chargebacks_90d, account_age_days
Velocity and risk signals	7	ip_risk_score, distance_from_home_km, failed_logins_7d, new_device_flag
Temporal/context	5	hour_of_day, day_of_week, is_weekend, country, currency

Size: 1.2M transactions, 30 features

Target: Binary — fraudulent transaction (1) vs legitimate transaction (0)

Class balance: 0.9% positive, 99.1% negative

Missing data: ~12% missing in distance_from_home_km, 7% in ip_risk_score, and sparse missingness in merchant metadata

Deliverables

Build and justify a modeling approach for this highly imbalanced classification problem.

Show how you would preprocess missing values and mixed feature types.

Compare at least one baseline against a stronger model.

Choose evaluation metrics appropriate for class imbalance and explain threshold selection.

Describe how you would deploy, monitor, and retrain the model in production.