LendWise, a mid-size digital lender processing about 120K consumer loan applications per month, wants a default-risk model that is accurate, explainable, and stable in production. The current challenge is not only training a classifier, but selecting a feature set that improves generalization while remaining interpretable for risk and compliance teams.
You are given a historical loan application dataset for binary classification: predict whether an approved loan will default within 12 months.
| Feature Group | Count | Examples |
|---|---|---|
| Applicant demographics | 6 | age, region, employment_status |
| Credit bureau variables | 12 | credit_score, delinquency_count_12m, utilization_ratio |
| Financial variables | 10 | annual_income, debt_to_income, monthly_obligations |
| Application metadata | 7 | channel, device_type, application_hour |
| Engineered behavior features | 9 | income_to_loan_ratio, recent_inquiry_rate, credit_age_bucket |
default_12m (1 = defaulted within 12 months, 0 = repaid/no default)A good solution should improve validation performance over a simple all-features logistic regression baseline while reducing unnecessary or redundant features. The final feature set should remain explainable enough for model risk review.