You are evaluating an LLM application that uses retrieval before generation, and the team wants a clean way to measure whether poor user outcomes come from bad retrieval, weak answer generation, or unsupported claims. You need an evaluation framework that separates these failure modes clearly enough to guide iteration.
What metrics would you use to measure retrieval quality, answer quality, and hallucination in an LLM application?
You are evaluating an LLM application that uses retrieval before generation, and the team wants a clean way to measure whether poor user outcomes come from bad retrieval, weak answer generation, or unsupported claims. You need an evaluation framework that separates these failure modes clearly enough to guide iteration.
What metrics would you use to measure retrieval quality, answer quality, and hallucination in an LLM application?