Context
FinSure offers an internal support copilot that answers customer-service questions using policy manuals, billing procedures, and compliance FAQs. A large enterprise customer reports that answer quality is inconsistent, but it is unclear whether the root cause is the prompt, retrieval pipeline, or underlying model.
Constraints
- p95 latency: 2,500ms end-to-end
- Cost ceiling: $0.03 per request and $18K/month at 20K requests/day
- Hallucination rate: <2% on a labeled golden set
- Prompt injection success rate: <1% on adversarial tests
- The system must cite retrieved sources for factual claims and refuse when evidence is insufficient
Available Resources
- 120K internal documents (PDFs, HTML help center pages, policy docs, ticket macros)
- Existing hybrid search stack (BM25 + dense vector search) and document metadata
- Three approved models: a small fast model, a mid-tier model, and a premium model
- 400 historical customer questions with human-rated answers, plus 50 known-bad examples from the customer
- Access to prompt templates, retrieval logs, ranked results, and model outputs
Task
- Propose a step-by-step diagnosis plan that isolates whether failures are caused by prompt design, retrieval quality, or model capability.
- Define an offline and online evaluation strategy, including how you would build a golden set, measure groundedness, and detect prompt injection or unsupported answers.
- Recommend an architecture and experimentation plan to test prompt-only, retrieval-only, and model-only changes while respecting latency and cost constraints.
- Write a production-quality prompt and Python implementation for a diagnostic harness that runs ablations and returns structured root-cause signals.
- List the most important failure modes, mitigations, and tradeoffs you would communicate to the customer and internal stakeholders.