An OpenAI research team is reviewing a new GPT-4.1-based assistant variant for customer-facing deployment in ChatGPT. The model was updated with additional instruction tuning and policy-focused preference optimization to improve refusal behavior and reduce unsafe outputs, but product teams report that task completion may have regressed.
| Metric | Baseline Model | New Aligned Model | Change |
|---|---|---|---|
| Helpfulness win rate (human eval) | 71% | 66% | -5 pts |
| Safety violation rate | 2.8% | 0.9% | -1.9 pts |
| Over-refusal rate on benign prompts | 6% | 18% | +12 pts |
| Factual accuracy on eval set | 84% | 81% | -3 pts |
| Calibration error | 0.07 | 0.11 | +0.04 |
| Task completion rate | 88% | 79% | -9 pts |
The team wants to distinguish whether these results indicate better alignment, worse model quality, or both. Your task is to explain how model alignment differs from model evaluation, and use the metrics above to diagnose what happened in this release.