Business Context
LendWise, a mid-size digital lender processing about 120K personal loan applications per month, wants a model to predict 90-day default risk at application time. The core interview task is not only to train a classifier, but to justify which modeling tool is most appropriate given strict constraints on interpretability, training speed, and low-latency batch scoring.
Dataset
| Feature Group | Count | Examples |
|---|
| Applicant demographics | 6 | age, employment_status, region, dependents |
| Credit history | 9 | credit_score, delinquencies_12m, utilization_rate, open_accounts |
| Income & affordability | 7 | annual_income, debt_to_income, housing_cost, verified_income_flag |
| Loan request | 5 | loan_amount, term_months, purpose, interest_rate_offer |
| Behavioral/application | 5 | application_channel, device_type, time_on_form, prior_application_count |
- Size: 420K historical applications, 32 input features
- Target: Binary — default within 90 days of origination
- Class balance: 11.4% default, 88.6% non-default
- Missing data: 8% missing in income verification fields, 3% missing in credit bureau attributes, 12% missing in behavioral fields for branch-originated applications
Success Criteria
A solution is considered good enough if it improves ranking quality over the current scorecard, achieves strong recall on defaulters without overwhelming underwriting, and remains explainable enough for risk and compliance review.
Constraints
- Predictions must score 120K applications in under 10 minutes in batch
- Risk team requires model explanations at feature level
- Training and retraining should be feasible weekly on a standard CPU machine
- The chosen tool should be robust to moderate missingness and mixed numeric/categorical inputs
Deliverables
- Select the most appropriate model family and justify why it fits these constraints better than at least one alternative.
- Build an end-to-end training pipeline with preprocessing, validation, and threshold selection.
- Report evaluation metrics relevant to imbalanced binary classification.
- Explain how you would handle missing data, categorical variables, and probability calibration.
- Describe how you would deploy and monitor the model in production.