LendWise, a mid-size digital lender processing ~300K consumer loan applications per year, wants a simpler and more stable default-risk model. The current model uses every available field, which has increased training time, reduced interpretability, and introduced noisy predictors.
You are given a historical loan application dataset for a binary classification problem: predict whether an approved loan will default within 12 months.
| Feature Group | Count | Examples |
|---|---|---|
| Applicant demographics | 8 | age, employment_length, home_ownership, marital_status |
| Credit bureau variables | 14 | fico_score, revolving_utilization, delinquencies_2y, inquiries_6m |
| Financial variables | 10 | annual_income, debt_to_income, loan_amount, installment |
| Application metadata | 6 | channel, state, application_hour, repeat_customer |
| Engineered candidates | 12 | income_to_loan_ratio, utilization_bucket, recent_inquiry_rate |
default_12m (1 = defaulted within 12 months, 0 = repaid/current)A good solution should reduce the feature set meaningfully while maintaining predictive quality. Target performance is ROC-AUC = 0.78 and F1 = 0.50 on the holdout set, with no more than a 2-point ROC-AUC drop versus a full-feature baseline.