You own a gradient-boosted pricing model used in an insurance pricing workflow to generate technical premiums before underwriters apply limited overrides in Akur8 Pricing. The model was trained on the last 24 months of policy and claims data, and pricing changes above ±12% versus the incumbent tariff are automatically routed for manual review. Leadership wants to promote the model from a monitored pilot to default production use, but actuarial reviewers have flagged that aggregate fit looks good while some recent quotes appear systematically underpriced in a few segments.
| Metric | Validation Set | Last 30 Days Pilot |
|---|---|---|
| MAE (premium error) | 7.8% | 8.4% |
| RMSE (premium error) | 11.2% | 15.9% |
| Mean signed error | +0.4% | -3.6% |
| Policies within 7% of indicated premium | 72% | 64% |
| Policies within 7%: low-risk segment | 79% | 78% |
| Policies within 7%: high-risk segment | 68% | 49% |
| Loss ratio on quoted business | 98.7% | 106.8% |
| Share of quotes sent to manual review | 9.5% | 18.1% |
How would you decide whether this pricing output is accurate enough for production use in Akur8 Pricing, and what would you investigate before recommending full rollout, limited rollout, or rollback?