NorthPeak Lending, a digital consumer lender processing about 120K loan applications per month, wants a default-risk model for instant underwriting. The current model performs well offline but degrades noticeably in production, suggesting overfitting to historical training data.
You are given a supervised binary classification dataset for predicting whether a loan will default within 12 months of origination.
| Feature Group | Count | Examples |
|---|---|---|
| Applicant financials | 12 | annual_income, debt_to_income, revolving_utilization, delinquencies_2y |
| Credit history | 9 | fico_band, credit_age_months, inquiries_6m, prior_defaults |
| Loan attributes | 8 | loan_amount, term_months, interest_rate, purpose |
| Behavioral / derived | 7 | application_hour, device_risk_score, income_to_loan_ratio, recent_address_changes |
| Metadata | 4 | state, channel, employer_type, verification_status |
default_12m — 1 if the loan defaulted within 12 months, else 0A good solution should improve generalization on unseen data, not just training accuracy. Target test AUC-ROC >= 0.78, test F1 >= 0.45, and keep the train-test AUC gap below 0.04 after applying regularization and model selection controls.