Find the Inverse-Square Error

Context

NovaLearn evaluates an LLM that solves physics word problems and shows full mathematical derivations. Reviewers found that some final answers are numerically close to correct, but the reasoning chain may contain a physics-law misuse. Your task is to evaluate whether the model's derivation correctly applies the inverse-square law and identify the exact step where it fails.

Current Performance

The team audited 1,200 derivation traces on radiation, gravity, and light-intensity problems. A derivation is marked correct only if both the final answer and the reasoning steps are valid.

Metric	Value
Step-level accuracy	0.91
Final-answer accuracy	0.84
Precision on "derivation error" flag	0.88
Recall on "derivation error" flag	0.69
F1 score	0.77
Cases involving inverse-square law	320
Inverse-square-law cases with reasoning error	96

The Problem

In inverse-square-law questions, the model often produces a plausible final number while making a subtle reasoning mistake such as scaling by $1/r $instead of \$ 1/r^2$, or inverting the ratio between two distances. The evaluation challenge is to diagnose where the derivation becomes invalid, not just whether the final answer is wrong.

Requirements

Identify the point in the derivation where the inverse-square law is applied incorrectly.
Explain how the metrics suggest step-level reasoning failures can be hidden by acceptable final-answer accuracy.
Describe what additional evaluation checks you would use to catch these errors systematically.
Recommend model or evaluation improvements to reduce this failure mode.

Constraints

Human review budget covers only 150 derivations per week.
The product team cannot rely on final-answer accuracy alone for safety-critical educational content.
Any proposed evaluation method must scale to multi-step derivations across physics topics.

Context

Current Performance

The team audited 1,200 derivation traces on radiation, gravity, and light-intensity problems. A derivation is marked correct only if both the final answer and the reasoning steps are valid.

Metric	Value
Step-level accuracy	0.91
Final-answer accuracy	0.84
Precision on "derivation error" flag	0.88
Recall on "derivation error" flag	0.69
F1 score	0.77
Cases involving inverse-square law	320
Inverse-square-law cases with reasoning error	96

The Problem

Requirements

Identify the point in the derivation where the inverse-square law is applied incorrectly.
Explain how the metrics suggest step-level reasoning failures can be hidden by acceptable final-answer accuracy.
Describe what additional evaluation checks you would use to catch these errors systematically.
Recommend model or evaluation improvements to reduce this failure mode.

Constraints

Human review budget covers only 150 derivations per week.
The product team cannot rely on final-answer accuracy alone for safety-critical educational content.
Any proposed evaluation method must scale to multi-step derivations across physics topics.

Context

Current Performance

The team audited 1,200 derivation traces on radiation, gravity, and light-intensity problems. A derivation is marked correct only if both the final answer and the reasoning steps are valid.

Metric	Value
Step-level accuracy	0.91
Final-answer accuracy	0.84
Precision on "derivation error" flag	0.88
Recall on "derivation error" flag	0.69
F1 score	0.77
Cases involving inverse-square law	320
Inverse-square-law cases with reasoning error	96

The Problem

Requirements

Identify the point in the derivation where the inverse-square law is applied incorrectly.
Explain how the metrics suggest step-level reasoning failures can be hidden by acceptable final-answer accuracy.
Describe what additional evaluation checks you would use to catch these errors systematically.
Recommend model or evaluation improvements to reduce this failure mode.

Constraints

Human review budget covers only 150 derivations per week.
The product team cannot rely on final-answer accuracy alone for safety-critical educational content.
Any proposed evaluation method must scale to multi-step derivations across physics topics.

Context

Current Performance

The team audited 1,200 derivation traces on radiation, gravity, and light-intensity problems. A derivation is marked correct only if both the final answer and the reasoning steps are valid.

Metric	Value
Step-level accuracy	0.91
Final-answer accuracy	0.84
Precision on "derivation error" flag	0.88
Recall on "derivation error" flag	0.69
F1 score	0.77
Cases involving inverse-square law	320
Inverse-square-law cases with reasoning error	96

The Problem

Requirements

Identify the point in the derivation where the inverse-square law is applied incorrectly.
Explain how the metrics suggest step-level reasoning failures can be hidden by acceptable final-answer accuracy.
Describe what additional evaluation checks you would use to catch these errors systematically.
Recommend model or evaluation improvements to reduce this failure mode.

Constraints

Human review budget covers only 150 derivations per week.
The product team cannot rely on final-answer accuracy alone for safety-critical educational content.
Any proposed evaluation method must scale to multi-step derivations across physics topics.

Interview Guides

Context

Current Performance

The Problem

Requirements

Constraints

Find the Inverse-Square Error

Context

Current Performance

The Problem

Requirements

Constraints

Your Answer

Find the Inverse-Square Error

Context

Current Performance

The Problem

Requirements

Constraints

Find the Inverse-Square Error

Context

Current Performance

The Problem

Requirements

Constraints

Your Answer