LendWise, a mid-size digital lender processing roughly 300K consumer loan applications per year, wants a default-risk model for underwriting. The current model uses nearly every available column, causing unstable performance, weak interpretability, and avoidable retraining issues when upstream fields change.
You are given a historical loan application dataset and asked to design a feature selection strategy for a binary classification model predicting whether an approved loan will default within 12 months.
| Feature Group | Count | Examples |
|---|---|---|
| Applicant demographics | 8 | age, region, employment_type, years_at_job |
| Credit bureau signals | 12 | credit_score, delinquencies_12m, utilization_rate, open_accounts |
| Financials | 10 | annual_income, debt_to_income, monthly_obligations, savings_balance |
| Application metadata | 6 | channel, device_type, application_hour, repeat_customer |
| Engineered candidates | 9 | income_per_open_account, recent_delinquency_flag, utilization_bucket |
A good solution should improve generalization while reducing the feature set to a maintainable subset. Target performance is ROC-AUC e 0.80 and PR-AUC e 0.42 on a held-out test set, with no more than 20 selected features and a clear explanation of why they were kept or removed.