Evaluate Two Physics QA Responses

Context

At NovaLearn, an LLM-based tutoring assistant answers high school and first-year college physics questions. The team is comparing two candidate response-generation models on prompts about gyroscopic precession, where factual correctness matters more than stylistic fluency.

Current Performance

Both models were evaluated on a held-out set of 500 gyroscopic precession prompts, scored by physics SMEs. A response is marked correct only if it explains the precession direction and cause without introducing a major physics error.

Metric	Model A	Model B
Accuracy	0.78	0.74
Precision	0.84	0.69
Recall	0.71	0.88
F1 Score	0.77	0.77
Major factual error rate	0.11	0.19
Calibration error	0.07	0.15
Avg. response length (tokens)	142	214

The Problem

The product team asks: Which response is better, and why? The challenge is that the two models have the same F1 score but very different error profiles. Model A is more conservative and more precise; Model B is more likely to provide a usable answer but also hallucinates more often.

Requirements

Decide which model you would prefer for a physics tutoring product and justify the choice.
Interpret the tradeoff between precision and recall in this setting.
Use the confusion matrix and error patterns to explain likely user impact.
Recommend 3-4 concrete next steps to improve evaluation and model quality.

Constraints

Incorrect physics explanations can reduce student trust.
The product cannot rely on human review at inference time.
Latency and token cost matter, so longer answers are not automatically better.

Context

Current Performance

Metric	Model A	Model B
Accuracy	0.78	0.74
Precision	0.84	0.69
Recall	0.71	0.88
F1 Score	0.77	0.77
Major factual error rate	0.11	0.19
Calibration error	0.07	0.15
Avg. response length (tokens)	142	214

The Problem

Requirements

Decide which model you would prefer for a physics tutoring product and justify the choice.
Interpret the tradeoff between precision and recall in this setting.
Use the confusion matrix and error patterns to explain likely user impact.
Recommend 3-4 concrete next steps to improve evaluation and model quality.

Constraints

Incorrect physics explanations can reduce student trust.
The product cannot rely on human review at inference time.
Latency and token cost matter, so longer answers are not automatically better.

Context

Current Performance

Metric	Model A	Model B
Accuracy	0.78	0.74
Precision	0.84	0.69
Recall	0.71	0.88
F1 Score	0.77	0.77
Major factual error rate	0.11	0.19
Calibration error	0.07	0.15
Avg. response length (tokens)	142	214

The Problem

Requirements

Decide which model you would prefer for a physics tutoring product and justify the choice.
Interpret the tradeoff between precision and recall in this setting.
Use the confusion matrix and error patterns to explain likely user impact.
Recommend 3-4 concrete next steps to improve evaluation and model quality.

Constraints

Incorrect physics explanations can reduce student trust.
The product cannot rely on human review at inference time.
Latency and token cost matter, so longer answers are not automatically better.

Context

Current Performance

Metric	Model A	Model B
Accuracy	0.78	0.74
Precision	0.84	0.69
Recall	0.71	0.88
F1 Score	0.77	0.77
Major factual error rate	0.11	0.19
Calibration error	0.07	0.15
Avg. response length (tokens)	142	214

The Problem

Requirements

Decide which model you would prefer for a physics tutoring product and justify the choice.
Interpret the tradeoff between precision and recall in this setting.
Use the confusion matrix and error patterns to explain likely user impact.
Recommend 3-4 concrete next steps to improve evaluation and model quality.

Constraints

Incorrect physics explanations can reduce student trust.
The product cannot rely on human review at inference time.
Latency and token cost matter, so longer answers are not automatically better.

Interview Guides

Context

Current Performance

The Problem

Requirements

Constraints

Evaluate Two Physics QA Responses

Context

Current Performance

The Problem

Requirements

Constraints

Your Answer

Evaluate Two Physics QA Responses

Context

Current Performance

The Problem

Requirements

Constraints

Evaluate Two Physics QA Responses

Context

Current Performance

The Problem

Requirements

Constraints

Your Answer