LendWise, a mid-size digital lender processing roughly 200K personal loan applications per quarter, wants a default-risk model that is accurate, stable, and explainable to credit analysts. The current model uses nearly every available column, causing leakage risk, unstable coefficients, and poor generalization.
You are given an offline training dataset of historical approved loans and repayment outcomes. The task is to design a feature selection strategy and train a binary classifier that predicts whether a borrower will default within 12 months of origination.
| Feature Group | Count | Examples |
|---|---|---|
| Applicant demographics | 8 | age, employment_status, education_level, region |
| Credit bureau variables | 14 | fico_score, delinquencies_12m, credit_utilization, inquiries_6m |
| Financial variables | 11 | annual_income, debt_to_income, monthly_obligations, savings_balance |
| Loan attributes | 7 | loan_amount, term_months, interest_rate, purpose |
| Derived / noisy variables | 10 | application_channel_score, partner_flag, duplicate bureau aggregates |
default_12m (1 = defaulted within 12 months, 0 = did not default)A strong solution should improve generalization over a model trained on all raw features, maintain interpretable drivers for risk review, and achieve ROC-AUC >= 0.80 with PR-AUC >= 0.42 on a held-out test set.