SureFast Insurance wants a tabular ML model to predict whether an auto insurance claim will exceed $10,000 so adjusters can prioritize high-risk cases. The team currently uses logistic regression and wants to understand gradient boosting in practice, including when XGBoost or LightGBM is the better choice.
You are given a historical claims dataset with mixed numerical and categorical features collected at claim creation time.
| Feature Group | Count | Examples |
|---|---|---|
| Claim details | 12 | claim_amount_initial, accident_type, vehicle_age, repair_shop_flag |
| Customer profile | 9 | policy_tenure_months, prior_claims_12m, credit_band, region |
| Policy info | 8 | coverage_type, deductible, premium, add_on_count |
| Temporal features | 5 | claim_month, day_of_week, holiday_flag, days_since_policy_start |
| External signals | 4 | weather_severity, traffic_index, fraud_risk_score, repair_cost_index |
A good solution should outperform logistic regression and deliver a model suitable for weekly batch scoring. Target AUC-ROC above 0.87 and PR-AUC above 0.52, while keeping inference fast enough to score 250K claims in under 10 minutes.