Context
FinPilot runs an LLM-powered support copilot for 4,000 agents handling billing, disputes, and policy questions. Usage has grown to 2.5M requests/day across chat summarization, grounded Q&A, and a small tool-using workflow; leadership wants tighter cost controls without losing answer quality or observability.
Constraints
- p95 latency: 1,800ms for Q&A, 2,500ms for tool-using flows
- Monthly inference budget: $180K at current volume
- Hallucination rate: <2% on a 1,000-query golden set for grounded answers
- Prompt-injection success rate: <0.5% on adversarial tests
- Must support per-tenant cost attribution, rate limiting, and incident debugging
- PII must be redacted from logs; prompts and outputs must remain auditable
Available Resources
- Historical request logs with prompt, model, token counts, latency, user feedback, and escalation outcome
- 120K internal help-center articles and policy docs, already permissioned by tenant
- Approved models: GPT-4.1-mini, GPT-4.1, and a cheaper summarization model
- Existing OpenTelemetry pipeline, Redis cache, and vector store with BM25 + dense retrieval
- 20 support leads available to label a golden set and review failures
Task
- Design an observability and cost-control strategy for large-scale LLM inference, including what to log, aggregate, alert on, and expose in dashboards.
- Propose an eval-first rollout plan: offline evaluation first, then online monitoring and experimentation for routing, caching, and fallback policies.
- Define an inference architecture that balances model routing, retrieval, caching, and tool use under the latency and budget constraints.
- Specify safeguards for hallucination, prompt injection, PII leakage, and runaway agent/tool loops.
- Estimate token, latency, and monthly cost impacts for your design, and explain the main tradeoffs.