At NovaLearn, an LLM-based tutoring assistant answers high school and first-year college physics questions. The team is comparing two candidate response-generation models on prompts about gyroscopic precession, where factual correctness matters more than stylistic fluency.
Both models were evaluated on a held-out set of 500 gyroscopic precession prompts, scored by physics SMEs. A response is marked correct only if it explains the precession direction and cause without introducing a major physics error.
| Metric | Model A | Model B |
|---|---|---|
| Accuracy | 0.78 | 0.74 |
| Precision | 0.84 | 0.69 |
| Recall | 0.71 | 0.88 |
| F1 Score | 0.77 | 0.77 |
| Major factual error rate | 0.11 | 0.19 |
| Calibration error | 0.07 | 0.15 |
| Avg. response length (tokens) | 142 | 214 |
The product team asks: Which response is better, and why? The challenge is that the two models have the same F1 score but very different error profiles. Model A is more conservative and more precise; Model B is more likely to provide a usable answer but also hallucinates more often.