Your RAG system is retrieving the right documents, but the model still produces confident answers that are not supported by those documents. You need a plan to reduce hallucinations without making the system uselessly conservative.
What would you change across prompting, generation, verification, and evaluation to make the answers more faithful to the retrieved context?