Compare XGBoost and LightGBM for Claims Risk

Business Context

SureFast Insurance wants a tabular ML model to predict whether an auto insurance claim will exceed $10,000 so adjusters can prioritize high-risk cases. The team currently uses logistic regression and wants to understand gradient boosting in practice, including when XGBoost or LightGBM is the better choice.

Dataset

You are given a historical claims dataset with mixed numerical and categorical features collected at claim creation time.

Feature Group	Count	Examples
Claim details	12	claim_amount_initial, accident_type, vehicle_age, repair_shop_flag
Customer profile	9	policy_tenure_months, prior_claims_12m, credit_band, region
Policy info	8	coverage_type, deductible, premium, add_on_count
Temporal features	5	claim_month, day_of_week, holiday_flag, days_since_policy_start
External signals	4	weather_severity, traffic_index, fraud_risk_score, repair_cost_index

Size: 240K claims, 38 features
Target: Binary — high-cost claim over $10,000 (1) vs not (0)
Class balance: 14% positive, 86% negative
Missing data: 10% missing in external signals, 6% missing in customer credit fields, sparse unseen categories in region and repair_shop_id

Success Criteria

A good solution should outperform logistic regression and deliver a model suitable for weekly batch scoring. Target AUC-ROC above 0.87 and PR-AUC above 0.52, while keeping inference fast enough to score 250K claims in under 10 minutes.

Constraints

Mixed feature types with moderate missingness
Need feature importance for model review and regulator-facing documentation
Batch inference only, but retraining should fit within a 2-hour weekly job
Candidate should explain gradient boosting clearly, then compare XGBoost vs LightGBM tradeoffs

Deliverables

Explain gradient boosting and how XGBoost and LightGBM implement it.
Build and compare an XGBoost model and a LightGBM model on this dataset.
Describe preprocessing, feature engineering, and validation strategy.
Report evaluation metrics and recommend one model for production.
Summarize practical tradeoffs: accuracy, speed, interpretability, and operational complexity.

Business Context

Dataset

You are given a historical claims dataset with mixed numerical and categorical features collected at claim creation time.

Feature Group	Count	Examples
Claim details	12	claim_amount_initial, accident_type, vehicle_age, repair_shop_flag
Customer profile	9	policy_tenure_months, prior_claims_12m, credit_band, region
Policy info	8	coverage_type, deductible, premium, add_on_count
Temporal features	5	claim_month, day_of_week, holiday_flag, days_since_policy_start
External signals	4	weather_severity, traffic_index, fraud_risk_score, repair_cost_index

Size: 240K claims, 38 features
Target: Binary — high-cost claim over $10,000 (1) vs not (0)
Class balance: 14% positive, 86% negative
Missing data: 10% missing in external signals, 6% missing in customer credit fields, sparse unseen categories in region and repair_shop_id

Success Criteria

Constraints

Mixed feature types with moderate missingness
Need feature importance for model review and regulator-facing documentation
Batch inference only, but retraining should fit within a 2-hour weekly job
Candidate should explain gradient boosting clearly, then compare XGBoost vs LightGBM tradeoffs

Deliverables

Explain gradient boosting and how XGBoost and LightGBM implement it.
Build and compare an XGBoost model and a LightGBM model on this dataset.
Describe preprocessing, feature engineering, and validation strategy.
Report evaluation metrics and recommend one model for production.
Summarize practical tradeoffs: accuracy, speed, interpretability, and operational complexity.

Business Context

Dataset

You are given a historical claims dataset with mixed numerical and categorical features collected at claim creation time.

Feature Group	Count	Examples
Claim details	12	claim_amount_initial, accident_type, vehicle_age, repair_shop_flag
Customer profile	9	policy_tenure_months, prior_claims_12m, credit_band, region
Policy info	8	coverage_type, deductible, premium, add_on_count
Temporal features	5	claim_month, day_of_week, holiday_flag, days_since_policy_start
External signals	4	weather_severity, traffic_index, fraud_risk_score, repair_cost_index

Size: 240K claims, 38 features
Target: Binary — high-cost claim over $10,000 (1) vs not (0)
Class balance: 14% positive, 86% negative
Missing data: 10% missing in external signals, 6% missing in customer credit fields, sparse unseen categories in region and repair_shop_id

Success Criteria

Constraints

Mixed feature types with moderate missingness
Need feature importance for model review and regulator-facing documentation
Batch inference only, but retraining should fit within a 2-hour weekly job
Candidate should explain gradient boosting clearly, then compare XGBoost vs LightGBM tradeoffs

Deliverables

Explain gradient boosting and how XGBoost and LightGBM implement it.
Build and compare an XGBoost model and a LightGBM model on this dataset.
Describe preprocessing, feature engineering, and validation strategy.
Report evaluation metrics and recommend one model for production.
Summarize practical tradeoffs: accuracy, speed, interpretability, and operational complexity.

Business Context

Dataset

You are given a historical claims dataset with mixed numerical and categorical features collected at claim creation time.

Feature Group	Count	Examples
Claim details	12	claim_amount_initial, accident_type, vehicle_age, repair_shop_flag
Customer profile	9	policy_tenure_months, prior_claims_12m, credit_band, region
Policy info	8	coverage_type, deductible, premium, add_on_count
Temporal features	5	claim_month, day_of_week, holiday_flag, days_since_policy_start
External signals	4	weather_severity, traffic_index, fraud_risk_score, repair_cost_index

Size: 240K claims, 38 features
Target: Binary — high-cost claim over $10,000 (1) vs not (0)
Class balance: 14% positive, 86% negative
Missing data: 10% missing in external signals, 6% missing in customer credit fields, sparse unseen categories in region and repair_shop_id

Success Criteria

Constraints

Mixed feature types with moderate missingness
Need feature importance for model review and regulator-facing documentation
Batch inference only, but retraining should fit within a 2-hour weekly job
Candidate should explain gradient boosting clearly, then compare XGBoost vs LightGBM tradeoffs

Deliverables

Explain gradient boosting and how XGBoost and LightGBM implement it.
Build and compare an XGBoost model and a LightGBM model on this dataset.
Describe preprocessing, feature engineering, and validation strategy.
Report evaluation metrics and recommend one model for production.
Summarize practical tradeoffs: accuracy, speed, interpretability, and operational complexity.

Interview Guides

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Compare XGBoost and LightGBM for Claims Risk

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Your Answer

Compare XGBoost and LightGBM for Claims Risk

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Compare XGBoost and LightGBM for Claims Risk

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Your Answer