Evaluate an Internal Writing Assistant

Scenario

You are improving an internal writing assistant that drafts summaries, rewrites text, and answers questions over uploaded documents for several hundred daily users. The current team reports that top-line accuracy looks acceptable on a small benchmark, but users still complain about unhelpful tone, unsupported claims, and inconsistent refusal behavior. The product is moving toward broader rollout, so you need an evaluation framework that measures quality beyond simple correctness and can catch regressions before launch.

Constraints

p95 end-to-end latency: 1,500ms for standard prompts
Cost ceiling: $0.03 per request at 200K requests/month
Unsupported factual claims must stay below 2% on a held-out evaluation set
The system must resist prompt injection from user-provided documents and avoid leaking sensitive text

Available Resources

Historical prompts, model outputs, and user feedback from the existing assistant
Access to a GPT-4-class model and a smaller low-cost model for grading or routing
A document corpus with known source passages for a subset of tasks
Capacity for ~500 human labels per month from internal reviewers

Question

How would you evaluate this language model beyond accuracy, and how would that evaluation plan drive your prompt, model, and system design choices under the latency, cost, and safety constraints above?

Scenario

Constraints

p95 end-to-end latency: 1,500ms for standard prompts
Cost ceiling: $0.03 per request at 200K requests/month
Unsupported factual claims must stay below 2% on a held-out evaluation set
The system must resist prompt injection from user-provided documents and avoid leaking sensitive text

Available Resources

Historical prompts, model outputs, and user feedback from the existing assistant
Access to a GPT-4-class model and a smaller low-cost model for grading or routing
A document corpus with known source passages for a subset of tasks
Capacity for ~500 human labels per month from internal reviewers

Question

Scenario

Constraints

p95 end-to-end latency: 1,500ms for standard prompts
Cost ceiling: $0.03 per request at 200K requests/month
Unsupported factual claims must stay below 2% on a held-out evaluation set
The system must resist prompt injection from user-provided documents and avoid leaking sensitive text

Available Resources

Historical prompts, model outputs, and user feedback from the existing assistant
Access to a GPT-4-class model and a smaller low-cost model for grading or routing
A document corpus with known source passages for a subset of tasks
Capacity for ~500 human labels per month from internal reviewers

Question

Scenario

Constraints

p95 end-to-end latency: 1,500ms for standard prompts
Cost ceiling: $0.03 per request at 200K requests/month
Unsupported factual claims must stay below 2% on a held-out evaluation set
The system must resist prompt injection from user-provided documents and avoid leaking sensitive text

Available Resources

Historical prompts, model outputs, and user feedback from the existing assistant
Access to a GPT-4-class model and a smaller low-cost model for grading or routing
A document corpus with known source passages for a subset of tasks
Capacity for ~500 human labels per month from internal reviewers

Interview Guides

Scenario

Constraints

Available Resources

Question

Evaluate an Internal Writing Assistant

Scenario

Constraints

Available Resources

Question

Your Answer

Evaluate an Internal Writing Assistant

Scenario

Constraints

Available Resources

Question

Evaluate an Internal Writing Assistant

Scenario

Constraints

Available Resources

Question

Your Answer