Context
Sparksoft is preparing to launch Sparksoft Support Copilot, an LLM-based assistant that answers customer support questions using help-center articles, product docs, and prior resolved tickets. The team wants a clear framework to decide whether the assistant is good enough for a limited launch.
Constraints
- p95 end-to-end latency must be < 2,500 ms in the Sparksoft support console
- Average inference + retrieval cost must stay < $0.035 per conversation turn
- Hallucination rate on factual support answers must be < 2% on a curated launch set
- Prompt injection success rate must be ~0% on adversarial tests
- Unsafe or policy-violating responses must be blocked or escalated
- The assistant must prefer refusal or escalation over confident guessing
Available Resources
- 40K Sparksoft help-center and product documentation articles
- 250K historical support conversations with resolution labels
- A 1,200-question candidate golden set drafted by support SMEs
- Access to approved LLMs (OpenAI GPT-4.1 / GPT-4.1-mini class) and embedding models
- Sparksoft search infrastructure with keyword and vector retrieval
- Human reviewers from QA and support operations for spot checks
Task
- Define a launch-readiness evaluation plan for Sparksoft Support Copilot, including offline and online metrics for answer quality, safety, latency, and cost.
- Specify the minimum thresholds you would require before launch, and explain which metrics are hard blockers versus monitor-only guardrails.
- Propose the prompting and system design choices you would use to reduce hallucinations and prompt-injection risk while preserving answer quality.
- Describe how you would segment evaluation (for example by issue type, customer tier, query ambiguity, or unsupported questions) so aggregate metrics do not hide critical failures.
- Outline a limited-launch plan with monitoring, rollback criteria, and how you would use user and agent feedback to decide whether to expand, pause, or retrain.