At AcmeAI, account managers use a generative assistant to draft client-facing answers about product capabilities, pricing, and compliance. Leadership wants a lightweight NLP system that flags responses with high hallucination risk before they are shared with non-technical stakeholders.
You have 180,000 historical prompt-response pairs labeled by reviewers as Low Risk, Needs Review, or High Risk for hallucination. Text is English-only. Prompts range from 10-120 words, and model responses range from 30-600 words (median 145). Labels are moderately imbalanced: 62% Low Risk, 28% Needs Review, 10% High Risk. Metadata includes model version, retrieval-used flag, and product domain (sales, legal, support), but the core task should rely on text.
A good solution achieves High Risk recall >= 0.90, macro-F1 >= 0.82, and supports analyst review with interpretable signals. Inference should stay under 150 ms per response in batch scoring.