FinEdge, a digital lending platform processing about 250K consumer loan applications per month, wants to improve its default-risk model. The credit team currently uses logistic regression, but they want a stronger baseline and a clear explanation of when XGBoost is the right choice for tabular classification problems.
You are given a historical loan application dataset and asked to build a binary classifier that predicts whether an approved loan will default within 12 months.
| Feature Group | Count | Examples |
|---|---|---|
| Applicant financials | 12 | annual_income, debt_to_income, revolving_utilization, credit_score |
| Loan attributes | 8 | loan_amount, interest_rate, term_months, purpose |
| Credit history | 10 | delinquencies_2y, inquiries_6m, oldest_trade_age, public_records |
| Application metadata | 6 | channel, state, employment_length, verification_status |
default_12m — 1 if the borrower defaults within 12 months, else 0A good solution should outperform logistic regression and random forest on recall and PR-AUC, while remaining practical for batch scoring. A strong answer should also explain what XGBoost is, why it works well on structured data, and its tradeoffs versus simpler models.