You are improving an internal writing assistant that drafts summaries, rewrites text, and answers questions over uploaded documents for several hundred daily users. The current team reports that top-line accuracy looks acceptable on a small benchmark, but users still complain about unhelpful tone, unsupported claims, and inconsistent refusal behavior. The product is moving toward broader rollout, so you need an evaluation framework that measures quality beyond simple correctness and can catch regressions before launch.
How would you evaluate this language model beyond accuracy, and how would that evaluation plan drive your prompt, model, and system design choices under the latency, cost, and safety constraints above?