Business Context
NorthStar Bank reviews roughly 120,000 personal loan applications per month and needs a transparent baseline model for default risk screening. The credit risk team wants a logistic regression model they can explain to auditors, product managers, and operations analysts.
Dataset
You are given an application-level dataset for binary classification: whether a borrower defaults within 12 months of origination.
| Feature Group | Count | Examples |
|---|
| Applicant demographics | 6 | age, employment_status, residential_status |
| Credit bureau variables | 10 | fico_score, delinquencies_12m, credit_utilization |
| Loan attributes | 7 | loan_amount, term_months, interest_rate, purpose |
| Banking behavior | 5 | avg_balance_90d, overdraft_count_6m, direct_deposit_flag |
| Derived ratios | 4 | debt_to_income, loan_to_income, utilization_trend |
- Size: 480,000 loan applications, 32 features
- Target: Binary label indicating default within 12 months
- Class balance: 11.4% default, 88.6% non-default
- Missing data: 9% missing in banking features, 3% missing in bureau variables, sparse missingness in employment fields
Success Criteria
A strong solution should produce a well-calibrated logistic regression model with ROC-AUC ≥ 0.78, PR-AUC ≥ 0.38, and clear coefficient-based explanations for the top risk drivers. The model should also support threshold tuning for different approval policies.
Constraints
- The model must be interpretable enough for regulatory review
- Batch scoring should complete in under 5 minutes for 120,000 applications
- Feature transformations must be reproducible in production
- The bank prefers a stable baseline over a complex black-box model
Deliverables
- Explain logistic regression in practical terms, including the sigmoid function, log-odds, coefficients, and why it is used for binary classification.
- Build a production-ready logistic regression pipeline for default prediction.
- Handle missing values, categorical variables, scaling, and class imbalance appropriately.
- Evaluate the model using threshold-free and threshold-based metrics.
- Interpret the learned coefficients and discuss when logistic regression is preferable to more complex models.