Context
Northstar Health wants an internal assistant for support agents answering questions about benefits, claims policy, and member communications. Leadership is debating whether to ship with careful prompting, a RAG system over internal documents, or fine-tuning for domain-specific behavior.
Constraints
- p95 latency must be under 2,500ms for live agent assist
- Budget ceiling: $35K/month at 1.2M requests/month
- Hallucination rate must stay below 2% on policy questions
- Prompt injection success rate must be below 0.5% on adversarial tests
- Answers that affect coverage or reimbursement must cite approved sources or refuse
- PHI must not be exposed to unauthorized users or external logs
Available Resources
- 120K internal documents: policy manuals, claims SOPs, plan summaries, and approved email templates
- 18 months of historical support chats with agent-written resolutions and QA scores
- Approved models: GPT-4.1-mini, GPT-4.1, and a fine-tunable smaller model for internal deployment
- Existing hybrid search stack: BM25 + vector search with metadata filters by plan, state, and effective date
- Security team can provide document-level ACLs and redaction services
- 2 product analysts and 15 QA specialists can label an evaluation set
Task
- Propose an evaluation-first framework to compare prompting-only, RAG, and fine-tuning for this use case. Define offline and online metrics, golden sets, and pass/fail thresholds before choosing an architecture.
- Recommend which approach you would launch first, and where the other two approaches still fit. Be explicit about trade-offs across latency, cost, maintainability, freshness, and safety.
- Design the target system for your recommended approach, including prompting strategy, retrieval or training data design, fallback/refusal behavior, and access control.
- Explain how you would test and mitigate hallucination, prompt injection, stale policy answers, and PHI leakage.
- Estimate request-level and monthly cost/latency, and describe what changes you would make if the system misses either the budget or latency target.