Business Context
Northstar Health underwrites small-business insurance policies and uses a generalized linear model to predict next-quarter claim cost per policy. The pricing team wants a model that remains stable in the presence of noisy broker-entered fields and many weakly predictive variables.
Dataset
You are given a tabular regression dataset built from 24 months of policy history.
| Feature Group | Count | Examples |
|---|
| Numerical policy attributes | 18 | payroll, employee_count, prior_claim_amount, premium |
| Categorical business descriptors | 9 | industry_code, state, broker_channel, coverage_tier |
| Engineered ratios | 7 | claim_rate_per_employee, premium_to_payroll |
| Sparse broker-entered flags | 6 | safety_program_reported, seasonal_business_flag |
- Size: 82K policy-quarter rows, 40 raw features before encoding
- Target: Continuous — total claim cost in the next quarter
- Data quality: Several broker-entered fields are noisy, some categorical levels are rare, and many features are weakly informative
- Missing data: 8% missing in broker-entered flags, 3% missing in financial fields
Success Criteria
A good solution should reduce overfitting versus an unregularized linear model, achieve lower validation RMSE, and provide a clear recommendation for when to prefer L1 or L2 regularization on this dataset.
Constraints
- The pricing team requires coefficient-level interpretability
- Batch scoring must complete in under 5 minutes for 100K policies
- Retraining happens monthly, so the approach should be simple and robust
Deliverables
- Explain the mathematical difference between L1 and L2 regularization in linear models.
- Train and compare Lasso and Ridge regression on the dataset.
- Show how regularization affects coefficient magnitude and sparsity.
- Use cross-validation to select the regularization strength.
- Recommend which method to deploy for a highly noisy dataset and justify the tradeoff.