Reduce Loan Default Overfitting

Business Context

NorthPeak Lending, a digital consumer lender processing about 120K loan applications per month, wants a default-risk model for instant underwriting. The current model performs well offline but degrades noticeably in production, suggesting overfitting to historical training data.

Dataset

You are given a supervised binary classification dataset for predicting whether a loan will default within 12 months of origination.

Feature Group	Count	Examples
Applicant financials	12	annual_income, debt_to_income, revolving_utilization, delinquencies_2y
Credit history	9	fico_band, credit_age_months, inquiries_6m, prior_defaults
Loan attributes	8	loan_amount, term_months, interest_rate, purpose
Behavioral / derived	7	application_hour, device_risk_score, income_to_loan_ratio, recent_address_changes
Metadata	4	state, channel, employer_type, verification_status

Size: 240K historical applications, 40 modeled features
Target: default_12m — 1 if the loan defaulted within 12 months, else 0
Class balance: 14% positive, 86% negative
Missing data: 6-18% missing in income verification and employment fields; sparse missingness in credit bureau features

Success Criteria

A good solution should improve generalization on unseen data, not just training accuracy. Target test AUC-ROC >= 0.78, test F1 >= 0.45, and keep the train-test AUC gap below 0.04 after applying regularization and model selection controls.

Constraints

Predictions must be returned in <50 ms per application
Risk and compliance teams require reasonable interpretability
Retraining happens monthly; feature engineering should remain maintainable
Avoid leakage from post-origination variables

Deliverables

Explain the main causes of overfitting in this dataset and training setup.
Build a baseline model and at least one regularized model.
Compare train vs validation/test performance and show how regularization changes the gap.
Use cross-validation to tune regularization strength.
Recommend a production-ready approach and justify the tradeoffs.

Business Context

Dataset

You are given a supervised binary classification dataset for predicting whether a loan will default within 12 months of origination.

Feature Group	Count	Examples
Applicant financials	12	annual_income, debt_to_income, revolving_utilization, delinquencies_2y
Credit history	9	fico_band, credit_age_months, inquiries_6m, prior_defaults
Loan attributes	8	loan_amount, term_months, interest_rate, purpose
Behavioral / derived	7	application_hour, device_risk_score, income_to_loan_ratio, recent_address_changes
Metadata	4	state, channel, employer_type, verification_status

Size: 240K historical applications, 40 modeled features
Target: default_12m — 1 if the loan defaulted within 12 months, else 0
Class balance: 14% positive, 86% negative
Missing data: 6-18% missing in income verification and employment fields; sparse missingness in credit bureau features

Success Criteria

Constraints

Predictions must be returned in <50 ms per application
Risk and compliance teams require reasonable interpretability
Retraining happens monthly; feature engineering should remain maintainable
Avoid leakage from post-origination variables

Deliverables

Explain the main causes of overfitting in this dataset and training setup.
Build a baseline model and at least one regularized model.
Compare train vs validation/test performance and show how regularization changes the gap.
Use cross-validation to tune regularization strength.
Recommend a production-ready approach and justify the tradeoffs.

Business Context

Dataset

You are given a supervised binary classification dataset for predicting whether a loan will default within 12 months of origination.

Feature Group	Count	Examples
Applicant financials	12	annual_income, debt_to_income, revolving_utilization, delinquencies_2y
Credit history	9	fico_band, credit_age_months, inquiries_6m, prior_defaults
Loan attributes	8	loan_amount, term_months, interest_rate, purpose
Behavioral / derived	7	application_hour, device_risk_score, income_to_loan_ratio, recent_address_changes
Metadata	4	state, channel, employer_type, verification_status

Size: 240K historical applications, 40 modeled features
Target: default_12m — 1 if the loan defaulted within 12 months, else 0
Class balance: 14% positive, 86% negative
Missing data: 6-18% missing in income verification and employment fields; sparse missingness in credit bureau features

Success Criteria

Constraints

Predictions must be returned in <50 ms per application
Risk and compliance teams require reasonable interpretability
Retraining happens monthly; feature engineering should remain maintainable
Avoid leakage from post-origination variables

Deliverables

Explain the main causes of overfitting in this dataset and training setup.
Build a baseline model and at least one regularized model.
Compare train vs validation/test performance and show how regularization changes the gap.
Use cross-validation to tune regularization strength.
Recommend a production-ready approach and justify the tradeoffs.

Business Context

Dataset

You are given a supervised binary classification dataset for predicting whether a loan will default within 12 months of origination.

Feature Group	Count	Examples
Applicant financials	12	annual_income, debt_to_income, revolving_utilization, delinquencies_2y
Credit history	9	fico_band, credit_age_months, inquiries_6m, prior_defaults
Loan attributes	8	loan_amount, term_months, interest_rate, purpose
Behavioral / derived	7	application_hour, device_risk_score, income_to_loan_ratio, recent_address_changes
Metadata	4	state, channel, employer_type, verification_status

Size: 240K historical applications, 40 modeled features
Target: default_12m — 1 if the loan defaulted within 12 months, else 0
Class balance: 14% positive, 86% negative
Missing data: 6-18% missing in income verification and employment fields; sparse missingness in credit bureau features

Success Criteria

Constraints

Predictions must be returned in <50 ms per application
Risk and compliance teams require reasonable interpretability
Retraining happens monthly; feature engineering should remain maintainable
Avoid leakage from post-origination variables

Deliverables

Explain the main causes of overfitting in this dataset and training setup.
Build a baseline model and at least one regularized model.
Compare train vs validation/test performance and show how regularization changes the gap.
Use cross-validation to tune regularization strength.
Recommend a production-ready approach and justify the tradeoffs.

Interview Guides

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Reduce Loan Default Overfitting

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Your Answer

Reduce Loan Default Overfitting

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Reduce Loan Default Overfitting

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Your Answer