LendWise, a mid-size digital lender processing ~120K personal loan applications per month, wants a default-risk model for pre-approval decisions. The current rule-based system drops rows with null values, reducing coverage and potentially biasing decisions.
You are given a historical application dataset for binary classification: predict whether an applicant will default within 12 months of origination.
| Feature Group | Count | Examples |
|---|---|---|
| Applicant demographics | 6 | age, employment_status, education_level, region |
| Financial profile | 9 | annual_income, debt_to_income, revolving_utilization, existing_loans |
| Credit bureau | 7 | credit_score, delinquency_count_12m, inquiries_6m, oldest_trade_age |
| Application metadata | 5 | channel, requested_amount, loan_term, application_hour |
| Derived flags | 4 | income_missing_flag, bureau_missing_flag, self_employed_flag |
default_12m (1 = default, 0 = non-default)annual_income: 18% missingcredit_score: 11% missingemployment_status: 6% missingdebt_to_income: 9% missingA good solution should improve over row-dropping and simple mean imputation baselines, achieve ROC-AUC >= 0.78, and maintain stable performance across applicants with and without missing bureau data.