Business Context
NorthStar Lending wants a production-ready model to predict whether a personal loan applicant will default within 12 months. The current rule-based system rejects too many good applicants, but the underwriting team is concerned that missing application and bureau fields may bias model performance.
Dataset
You are given a historical loan dataset collected from online applications and third-party credit bureau pulls.
| Feature Group | Count | Examples |
|---|
| Applicant demographics | 6 | age, employment_status, residence_type |
| Financial variables | 10 | annual_income, debt_to_income, monthly_obligations |
| Credit bureau fields | 8 | credit_score, delinquencies_12m, utilization_rate |
| Application metadata | 6 | channel, device_type, application_hour |
| Target | 1 | defaulted_12m |
- Size: 120,000 loan applications, 30 input features
- Target: Binary — default within 12 months (1) vs no default (0)
- Class balance: 18% default, 82% non-default
- Missing data: 22% missing in bureau variables, 9% missing in income-related fields, 3% missing in categorical application fields
Success Criteria
A good solution should improve over a complete-case baseline while handling missingness without leakage. Target performance is ROC-AUC >= 0.80 and PR-AUC >= 0.45, with stable validation performance across folds.
Constraints
- The underwriting team needs a model that is reasonably interpretable
- Inference must complete in under 50 ms per application in the online approval flow
- The solution must support weekly retraining and robust behavior when missingness rates shift
Deliverables
- Build a classification pipeline that handles missing numerical and categorical data correctly
- Compare at least two missing-data strategies, including row deletion vs imputation-based modeling
- Explain how you would detect whether missingness itself is predictive and whether to add missing-indicator features
- Evaluate the final model with appropriate validation and business-relevant metrics
- Describe how you would monitor missingness patterns and model drift in production