Business Context
NorthStar Bank wants to improve a loan default model used for pre-approval decisions on ~250K consumer loan applications per month. The current model has acceptable AUC but is difficult to maintain because it uses too many weak, redundant, and unstable features.
Dataset
You are given a historical tabular dataset of loan applications and 12-month repayment outcomes. The goal is to design a feature selection strategy that improves generalization, preserves interpretability for risk analysts, and avoids leakage.
| Feature Group | Count | Examples |
|---|
| Applicant demographics | 8 | age, employment_status, residential_status |
| Financial attributes | 14 | annual_income, debt_to_income, revolving_utilization |
| Credit bureau variables | 19 | delinquency_count_12m, inquiries_6m, oldest_trade_age |
| Application metadata | 7 | channel, product_type, requested_amount |
| Engineered aggregates | 12 | income_per_open_trade, utilization_trend_3m |
- Size: 420K applications, 60 candidate features
- Target: Binary — default within 12 months
- Class balance: 11.5% default, 88.5% non-default
- Missing data: 18% missing in bureau variables for thin-file applicants; 6% missing in income-related fields
Success Criteria
A strong solution should:
- Improve validation performance over a regularized logistic regression baseline
- Reduce the feature set to a smaller, defensible subset without materially hurting recall
- Provide a repeatable selection process that risk and compliance teams can review
Constraints
- Final model must remain interpretable enough for adverse action reasoning
- Batch scoring latency must stay under 50 ms per application
- Features unavailable at application time cannot be used
- The bank prefers a stable feature set retrained quarterly, not weekly
Deliverables
- Define a feature selection framework, including leakage checks and handling of correlated variables.
- Build a baseline and at least one selected-feature model.
- Compare filter, embedded, or wrapper-style selection methods and justify the final choice.
- Report evaluation metrics on a held-out test set.
- Summarize the final selected features and explain why they were retained or removed.