Interview Guides

Prevent Loan Default Model Overfitting

Easy

Machine Learning

Business Context

NorthStar Bank uses a supervised learning model to predict whether a personal loan applicant will default within 12 months. The current model performs well on training data but degrades noticeably after deployment, so the team wants a more robust approach that explicitly addresses overfitting.

Dataset

You are given a historical loan dataset used for binary classification.

Feature Group	Count	Examples
Applicant demographics	6	age, employment_length, home_ownership
Financial attributes	9	annual_income, debt_to_income, revolving_utilization
Credit history	7	fico_score, delinquencies_2y, inquiries_6m
Loan attributes	5	loan_amount, interest_rate, term_months
Engineered behavior flags	5	high_utilization_flag, recent_inquiry_ratio

Size: 120K loan applications, 32 input features
Target: Binary — defaulted within 12 months (1) vs paid as agreed (0)
Class balance: 18% positive, 82% negative
Missing data: 8% missing in employment_length, 5% in annual_income, <2% elsewhere

Success Criteria

A good solution should improve generalization, not just training accuracy. Target performance is ROC-AUC >= 0.82 on a held-out test set, with a train-test AUC gap <= 0.03 and stable cross-validation results.

Constraints

The risk team requires reasonable interpretability
Batch scoring must complete in under 5 minutes for 50K applications
The solution should avoid leakage and be easy to retrain monthly

Deliverables

Build a baseline model and show evidence of overfitting
Propose and implement at least two overfitting prevention techniques
Compare a simpler regularized model against a higher-capacity model
Use cross-validation and a true holdout test set
Explain which approach you would ship and why

Prevent Loan Default Model Overfitting

Easy

Machine Learning

Business Context

Dataset

You are given a historical loan dataset used for binary classification.

Feature Group	Count	Examples
Applicant demographics	6	age, employment_length, home_ownership
Financial attributes	9	annual_income, debt_to_income, revolving_utilization
Credit history	7	fico_score, delinquencies_2y, inquiries_6m
Loan attributes	5	loan_amount, interest_rate, term_months
Engineered behavior flags	5	high_utilization_flag, recent_inquiry_ratio

Size: 120K loan applications, 32 input features
Target: Binary — defaulted within 12 months (1) vs paid as agreed (0)
Class balance: 18% positive, 82% negative
Missing data: 8% missing in employment_length, 5% in annual_income, <2% elsewhere

Success Criteria

Constraints

The risk team requires reasonable interpretability
Batch scoring must complete in under 5 minutes for 50K applications
The solution should avoid leakage and be easy to retrain monthly

Deliverables

Build a baseline model and show evidence of overfitting
Propose and implement at least two overfitting prevention techniques
Compare a simpler regularized model against a higher-capacity model
Use cross-validation and a true holdout test set
Explain which approach you would ship and why

Your Answer

Prevent Loan Default Model Overfitting

Easy

Machine Learning

Business Context

Dataset

You are given a historical loan dataset used for binary classification.

Feature Group	Count	Examples
Applicant demographics	6	age, employment_length, home_ownership
Financial attributes	9	annual_income, debt_to_income, revolving_utilization
Credit history	7	fico_score, delinquencies_2y, inquiries_6m
Loan attributes	5	loan_amount, interest_rate, term_months
Engineered behavior flags	5	high_utilization_flag, recent_inquiry_ratio

Size: 120K loan applications, 32 input features
Target: Binary — defaulted within 12 months (1) vs paid as agreed (0)
Class balance: 18% positive, 82% negative
Missing data: 8% missing in employment_length, 5% in annual_income, <2% elsewhere

Success Criteria

Constraints

The risk team requires reasonable interpretability
Batch scoring must complete in under 5 minutes for 50K applications
The solution should avoid leakage and be easy to retrain monthly

Deliverables

Build a baseline model and show evidence of overfitting
Propose and implement at least two overfitting prevention techniques
Compare a simpler regularized model against a higher-capacity model
Use cross-validation and a true holdout test set
Explain which approach you would ship and why

Prevent Loan Default Model Overfitting

Easy

Machine Learning

Business Context

Dataset

You are given a historical loan dataset used for binary classification.

Feature Group	Count	Examples
Applicant demographics	6	age, employment_length, home_ownership
Financial attributes	9	annual_income, debt_to_income, revolving_utilization
Credit history	7	fico_score, delinquencies_2y, inquiries_6m
Loan attributes	5	loan_amount, interest_rate, term_months
Engineered behavior flags	5	high_utilization_flag, recent_inquiry_ratio

Size: 120K loan applications, 32 input features
Target: Binary — defaulted within 12 months (1) vs paid as agreed (0)
Class balance: 18% positive, 82% negative
Missing data: 8% missing in employment_length, 5% in annual_income, <2% elsewhere

Success Criteria

Constraints

The risk team requires reasonable interpretability
Batch scoring must complete in under 5 minutes for 50K applications
The solution should avoid leakage and be easy to retrain monthly

Deliverables

Build a baseline model and show evidence of overfitting
Propose and implement at least two overfitting prevention techniques
Compare a simpler regularized model against a higher-capacity model
Use cross-validation and a true holdout test set
Explain which approach you would ship and why