Dataford
Interview Guides
Upgrade
All questions/Machine Learning/Predict Loan Default with Ensembles

Predict Loan Default with Ensembles

Medium
Machine Learning
Asked at 1 company1Cross-ValidationHyperparameter TuningFeature Engineering
Also asked at
Happiest Baby

Problem

Business Context

NorthBridge Finance issues unsecured personal loans to roughly 120,000 applicants per month. The risk team wants a production-ready model that predicts whether an applicant will default within 12 months so underwriting can reduce losses without rejecting too many good customers.

Dataset

You are given a historical loan-origination dataset built at application time only. Do not use post-loan repayment behavior as features.

Feature GroupCountExamples
Applicant demographics6age, region, housing_status, dependents
Credit bureau variables14bureau_score, delinquencies_12m, credit_utilization, inquiries_6m
Income & employment9annual_income, employment_length, employer_type, income_to_debt_ratio
Loan application details8loan_amount, term_months, interest_rate, purpose, channel
Engineered temporal signals5applications_last_30d, days_since_last_inquiry, bureau_file_age
  • Rows: 420,000 funded loans from the last 24 months
  • Target: default_12m = 1 if the borrower becomes 90+ days past due within 12 months, else 0
  • Class balance: 11.4% positive, 88.6% negative
  • Missing data: 18% missing in employment fields, 7% in bureau variables for thin-file applicants, <2% elsewhere

Success Criteria

A good solution should achieve strong ranking performance and support threshold-based underwriting decisions. Target ROC-AUC >= 0.84, PR-AUC >= 0.42, and recall >= 0.70 at precision >= 0.35 on a held-out out-of-time test set.

Constraints

  • Inference must complete in <50 ms per application in an online API
  • The risk team needs feature-level explanations for adverse action review
  • Retraining can happen weekly; feature generation must be reproducible in batch and online
  • The model must be robust to missing values and moderate class imbalance

Deliverables

  1. Build a binary classification pipeline for default_12m
  2. Compare at least one linear baseline and one tree-based ensemble
  3. Explain feature engineering, leakage prevention, and validation design
  4. Select an operating threshold for underwriting based on precision/recall tradeoffs
  5. Report final test metrics and the top risk drivers

Problem

Business Context

NorthBridge Finance issues unsecured personal loans to roughly 120,000 applicants per month. The risk team wants a production-ready model that predicts whether an applicant will default within 12 months so underwriting can reduce losses without rejecting too many good customers.

Dataset

You are given a historical loan-origination dataset built at application time only. Do not use post-loan repayment behavior as features.

Feature GroupCountExamples
Applicant demographics6age, region, housing_status, dependents
Credit bureau variables14bureau_score, delinquencies_12m, credit_utilization, inquiries_6m
Income & employment9annual_income, employment_length, employer_type, income_to_debt_ratio
Loan application details8loan_amount, term_months, interest_rate, purpose, channel
Engineered temporal signals5applications_last_30d, days_since_last_inquiry, bureau_file_age
  • Rows: 420,000 funded loans from the last 24 months
  • Target: default_12m = 1 if the borrower becomes 90+ days past due within 12 months, else 0
  • Class balance: 11.4% positive, 88.6% negative
  • Missing data: 18% missing in employment fields, 7% in bureau variables for thin-file applicants, <2% elsewhere

Success Criteria

A good solution should achieve strong ranking performance and support threshold-based underwriting decisions. Target ROC-AUC >= 0.84, PR-AUC >= 0.42, and recall >= 0.70 at precision >= 0.35 on a held-out out-of-time test set.

Constraints

  • Inference must complete in <50 ms per application in an online API
  • The risk team needs feature-level explanations for adverse action review
  • Retraining can happen weekly; feature generation must be reproducible in batch and online
  • The model must be robust to missing values and moderate class imbalance

Deliverables

  1. Build a binary classification pipeline for default_12m
  2. Compare at least one linear baseline and one tree-based ensemble
  3. Explain feature engineering, leakage prevention, and validation design
  4. Select an operating threshold for underwriting based on precision/recall tradeoffs
  5. Report final test metrics and the top risk drivers
Your answer
Try one AI text evaluation on us
Get structured feedback, scored against a 4-axis rubric. Premium unlocks unlimited.
0 wordstarget ~200
Up next
Predict Loan Default with Gradient BoostingMediumSelect Loan Default Model Under ConstraintsEasyPredict Loan Default End-to-EndEasy
Next question