Ship a Customer Workflow Agent

Context

FinFlow, a B2B accounts-payable platform, wants an LLM-powered support agent embedded directly into customer workflows. The agent should help finance ops users answer invoice questions, draft dispute emails, and trigger safe actions like updating payment status or opening a ticket.

Constraints

p95 end-to-end latency: 3,000ms for read-only requests; 5,000ms for action-taking requests
Cost ceiling: $35K/month at 120K requests/day
Hallucination ceiling: <2% on a labeled workflow test set
Prompt-injection success rate: <0.5% on adversarial tests
Any action that changes customer data must be auditable and require explicit confirmation
The system must not expose PII or data from another customer tenant

Available Resources

2 years of support tickets, help-center articles, API docs, and workflow logs
Tools: search_kb, get_invoice, get_vendor, create_ticket, update_payment_status, draft_email
Approved models: GPT-4.1-mini for orchestration, GPT-4.1 for hard cases
Existing RBAC service, audit log service, and feedback widget
20 customer-success managers available to label a golden set of real tasks

Task

Design an agentic workflow that can answer questions and safely take actions inside customer workflows, including tool selection, confirmation steps, and termination criteria.
Define an evaluation plan before architecture: offline golden-set evaluation, hallucination measurement, prompt-injection testing, and online metrics after launch.
Write a system prompt that enforces tenant isolation, grounded reasoning, safe refusal behavior, and structured outputs for both read-only and action-taking flows.
Estimate cost and latency at target volume, and explain when you would route to the larger model versus the cheaper model.
Identify the top failure modes you expect in production and how you would detect and mitigate them.

Constraints

p95 end-to-end latency: 3,000ms for read-only requests; 5,000ms for action-taking requests

Cost ceiling: $35K/month at 120K requests/day

Hallucination ceiling: <2% on a labeled workflow test set

Prompt-injection success rate: <0.5% on adversarial tests

Any action that changes customer data must be auditable and require explicit confirmation

The system must not expose PII or data from another customer tenant

Available Resources

2 years of support tickets, help-center articles, API docs, and workflow logs

Tools: search_kb, get_invoice, get_vendor, create_ticket, update_payment_status, draft_email

Approved models: GPT-4.1-mini for orchestration, GPT-4.1 for hard cases

Existing RBAC service, audit log service, and feedback widget

20 customer-success managers available to label a golden set of real tasks

Task

Design an agentic workflow that can answer questions and safely take actions inside customer workflows, including tool selection, confirmation steps, and termination criteria.

Define an evaluation plan before architecture: offline golden-set evaluation, hallucination measurement, prompt-injection testing, and online metrics after launch.

Write a system prompt that enforces tenant isolation, grounded reasoning, safe refusal behavior, and structured outputs for both read-only and action-taking flows.

Estimate cost and latency at target volume, and explain when you would route to the larger model versus the cheaper model.

Identify the top failure modes you expect in production and how you would detect and mitigate them.

Constraints

p95 end-to-end latency: 3,000ms for read-only requests; 5,000ms for action-taking requests

Cost ceiling: $35K/month at 120K requests/day

Hallucination ceiling: <2% on a labeled workflow test set

Prompt-injection success rate: <0.5% on adversarial tests

Any action that changes customer data must be auditable and require explicit confirmation

The system must not expose PII or data from another customer tenant

Available Resources

2 years of support tickets, help-center articles, API docs, and workflow logs

Tools: search_kb, get_invoice, get_vendor, create_ticket, update_payment_status, draft_email

Approved models: GPT-4.1-mini for orchestration, GPT-4.1 for hard cases

Existing RBAC service, audit log service, and feedback widget

20 customer-success managers available to label a golden set of real tasks

Task

Design an agentic workflow that can answer questions and safely take actions inside customer workflows, including tool selection, confirmation steps, and termination criteria.

Define an evaluation plan before architecture: offline golden-set evaluation, hallucination measurement, prompt-injection testing, and online metrics after launch.

Write a system prompt that enforces tenant isolation, grounded reasoning, safe refusal behavior, and structured outputs for both read-only and action-taking flows.

Estimate cost and latency at target volume, and explain when you would route to the larger model versus the cheaper model.

Identify the top failure modes you expect in production and how you would detect and mitigate them.

Constraints

p95 end-to-end latency: 3,000ms for read-only requests; 5,000ms for action-taking requests

Cost ceiling: $35K/month at 120K requests/day

Hallucination ceiling: <2% on a labeled workflow test set

Prompt-injection success rate: <0.5% on adversarial tests

Any action that changes customer data must be auditable and require explicit confirmation

The system must not expose PII or data from another customer tenant

Available Resources

2 years of support tickets, help-center articles, API docs, and workflow logs

Tools: search_kb, get_invoice, get_vendor, create_ticket, update_payment_status, draft_email

Approved models: GPT-4.1-mini for orchestration, GPT-4.1 for hard cases

Existing RBAC service, audit log service, and feedback widget

20 customer-success managers available to label a golden set of real tasks

Task

Design an agentic workflow that can answer questions and safely take actions inside customer workflows, including tool selection, confirmation steps, and termination criteria.

Define an evaluation plan before architecture: offline golden-set evaluation, hallucination measurement, prompt-injection testing, and online metrics after launch.

Write a system prompt that enforces tenant isolation, grounded reasoning, safe refusal behavior, and structured outputs for both read-only and action-taking flows.

Estimate cost and latency at target volume, and explain when you would route to the larger model versus the cheaper model.

Identify the top failure modes you expect in production and how you would detect and mitigate them.

Interview Guides

Context

Constraints

Available Resources

Task

Ship a Customer Workflow Agent

Context

Constraints

Available Resources

Task

Your Answer

Ship a Customer Workflow Agent

Context

Constraints

Available Resources

Task

Ship a Customer Workflow Agent

Context

Constraints

Available Resources

Task

Your Answer