Context
FinFlow has an LLM-powered support copilot that helps operations analysts resolve failed payment API calls. The copilot reads API docs, inspects recent request/response logs, and suggests the likely root cause plus the next troubleshooting step.
Constraints
- p95 latency: 2,500ms end-to-end
- Cost ceiling: $12K/month at 300K troubleshooting sessions
- Accuracy bar: top suggested root cause must be correct in at least 85% of labeled incidents
- Hallucination ceiling: unsupported claims in fewer than 4% of responses
- Safety: must not leak secrets from logs, must resist prompt injection in API payloads or docs, and must refuse when evidence is insufficient
Available Resources
- 18 months of historical API incident tickets with final root-cause labels
- API gateway logs containing request metadata, status codes, sanitized payload snippets, and upstream error messages
- Internal API documentation, runbooks, changelogs, and vendor integration guides
- Approved models: GPT-4.1-mini for orchestration and GPT-4.1 for hard cases
- Tools available to the agent:
search_docs, search_incidents, get_recent_logs, get_api_schema
Task
You are asked to design and troubleshoot this integration assistant after users report that it gives plausible but wrong diagnoses for complex failures such as auth drift, schema mismatches, idempotency issues, and vendor-side outages.
- Propose an evaluation-first plan to diagnose where the system is failing: retrieval, tool use, prompt design, reasoning, or stale documentation.
- Design the agent architecture, including tool-calling flow, termination criteria, and fallback behavior when evidence is incomplete or conflicting.
- Write a system prompt that forces grounded troubleshooting, safe refusal, and structured output with root cause, evidence, confidence, and next action.
- Estimate cost and latency for a typical troubleshooting session, and explain how you would stay within budget without materially increasing hallucinations.
- Identify the main failure modes for this API integration use case, including prompt injection via logs/docs and leakage of secrets or PII.