Dataford
Interview Guides
Upgrade
All questions/Machine Learning/Detect Card Fraud with Imbalanced Data

Detect Card Fraud with Imbalanced Data

Easy
Machine Learning
Asked at 9 companies9Supervised LearningCross-Validation
Also asked at
MachinifyH&R BlockSynechronRang TechnologiesSteampunkAAA Life Insurance

Problem

Business Context

PayLink processes roughly 8 million card transactions per day. Fraud is rare but expensive, so the risk team needs a model that catches fraudulent transactions without overwhelming manual reviewers with false positives.

Dataset

You are given a historical transaction dataset for binary classification: predict whether a transaction is fraudulent (is_fraud=1) or legitimate (is_fraud=0). The data is sampled from 6 months of production traffic.

Feature GroupCountExamples
Transaction attributes10amount, merchant_category, payment_method, device_type
Customer behavior8transactions_24h, avg_amount_30d, chargebacks_90d, account_age_days
Velocity and risk signals7ip_risk_score, distance_from_home_km, failed_logins_7d, new_device_flag
Temporal/context5hour_of_day, day_of_week, is_weekend, country, currency
  • Size: 1.2M transactions, 30 features
  • Target: Binary — fraudulent transaction (1) vs legitimate transaction (0)
  • Class balance: 0.9% positive, 99.1% negative
  • Missing data: ~12% missing in distance_from_home_km, 7% in ip_risk_score, and sparse missingness in merchant metadata

Success Criteria

A good solution should achieve recall >= 0.85 on fraud while keeping precision >= 0.25 at the operating threshold. The model should also improve ranking quality enough to support manual review queues.

Constraints

  • Batch scoring every 5 minutes; per-row inference should remain low-latency
  • False negatives are costly, but false positives create reviewer load and customer friction
  • The fraud team needs feature-level explanations for flagged transactions

Deliverables

  1. Build and justify a modeling approach for this highly imbalanced classification problem.
  2. Show how you would preprocess missing values and mixed feature types.
  3. Compare at least one baseline against a stronger model.
  4. Choose evaluation metrics appropriate for class imbalance and explain threshold selection.
  5. Describe how you would deploy, monitor, and retrain the model in production.

Problem

Business Context

PayLink processes roughly 8 million card transactions per day. Fraud is rare but expensive, so the risk team needs a model that catches fraudulent transactions without overwhelming manual reviewers with false positives.

Dataset

You are given a historical transaction dataset for binary classification: predict whether a transaction is fraudulent (is_fraud=1) or legitimate (is_fraud=0). The data is sampled from 6 months of production traffic.

Feature GroupCountExamples
Transaction attributes10amount, merchant_category, payment_method, device_type
Customer behavior8transactions_24h, avg_amount_30d, chargebacks_90d, account_age_days
Velocity and risk signals7ip_risk_score, distance_from_home_km, failed_logins_7d, new_device_flag
Temporal/context5hour_of_day, day_of_week, is_weekend, country, currency
  • Size: 1.2M transactions, 30 features
  • Target: Binary — fraudulent transaction (1) vs legitimate transaction (0)
  • Class balance: 0.9% positive, 99.1% negative
  • Missing data: ~12% missing in distance_from_home_km, 7% in ip_risk_score, and sparse missingness in merchant metadata

Success Criteria

A good solution should achieve recall >= 0.85 on fraud while keeping precision >= 0.25 at the operating threshold. The model should also improve ranking quality enough to support manual review queues.

Constraints

  • Batch scoring every 5 minutes; per-row inference should remain low-latency
  • False negatives are costly, but false positives create reviewer load and customer friction
  • The fraud team needs feature-level explanations for flagged transactions

Deliverables

  1. Build and justify a modeling approach for this highly imbalanced classification problem.
  2. Show how you would preprocess missing values and mixed feature types.
  3. Compare at least one baseline against a stronger model.
  4. Choose evaluation metrics appropriate for class imbalance and explain threshold selection.
  5. Describe how you would deploy, monitor, and retrain the model in production.
Your answer
Try one AI text evaluation on us
Get structured feedback, scored against a 4-axis rubric. Premium unlocks unlimited.
0 wordstarget ~200
Up next
Persistent SystemsDetect Card Fraud with Imbalanced DataEasyADetect Card Fraud with Imbalanced DataEasyapexanalytixDetect Card Fraud with Imbalanced DataEasy
Next question