You are preparing to ship a RAG system and a general LLM application, and the team wants a clear evaluation plan for both offline testing and production monitoring. You need to show how you would assess answer quality, retrieval quality, hallucination risk, and whether a new version actually improves user outcomes after release.
How would you evaluate a RAG system and an LLM application before and after deployment?
You are preparing to ship a RAG system and a general LLM application, and the team wants a clear evaluation plan for both offline testing and production monitoring. You need to show how you would assess answer quality, retrieval quality, hallucination risk, and whether a new version actually improves user outcomes after release.
How would you evaluate a RAG system and an LLM application before and after deployment?