Context
FinFlow, a B2B accounts-payable platform, wants an LLM-powered support agent embedded directly into customer workflows. The agent should help finance ops users answer invoice questions, draft dispute emails, and trigger safe actions like updating payment status or opening a ticket.
Constraints
- p95 end-to-end latency: 3,000ms for read-only requests; 5,000ms for action-taking requests
- Cost ceiling: $35K/month at 120K requests/day
- Hallucination ceiling: <2% on a labeled workflow test set
- Prompt-injection success rate: <0.5% on adversarial tests
- Any action that changes customer data must be auditable and require explicit confirmation
- The system must not expose PII or data from another customer tenant
Available Resources
- 2 years of support tickets, help-center articles, API docs, and workflow logs
- Tools:
search_kb, get_invoice, get_vendor, create_ticket, update_payment_status, draft_email
- Approved models: GPT-4.1-mini for orchestration, GPT-4.1 for hard cases
- Existing RBAC service, audit log service, and feedback widget
- 20 customer-success managers available to label a golden set of real tasks
Task
- Design an agentic workflow that can answer questions and safely take actions inside customer workflows, including tool selection, confirmation steps, and termination criteria.
- Define an evaluation plan before architecture: offline golden-set evaluation, hallucination measurement, prompt-injection testing, and online metrics after launch.
- Write a system prompt that enforces tenant isolation, grounded reasoning, safe refusal behavior, and structured outputs for both read-only and action-taking flows.
- Estimate cost and latency at target volume, and explain when you would route to the larger model versus the cheaper model.
- Identify the top failure modes you expect in production and how you would detect and mitigate them.