Failure Modes in LLM Workflows

Context

FinPilot is adding an LLM-powered operations copilot that helps support agents resolve billing disputes. The assistant reads CRM notes, policy docs, and transaction history, then recommends the next action or drafts a customer response.

Constraints

p95 end-to-end latency: 2,500ms
Cost ceiling: $0.035 per request and $25K/month at 30K requests/day
Hallucination ceiling: <2% on policy-grounded recommendations
Unsafe action rate: 0% for irreversible actions (refund approval, account closure)
Must resist prompt injection in CRM notes and customer-provided text
Must not leak PII or hidden system instructions

Available Resources

80K internal policy and workflow documents, updated daily
CRM tickets with agent notes, customer messages, and structured account metadata
Tools: policy search, transaction lookup, refund eligibility checker, ticket escalation API
Approved models: a small fast model for classification/routing and a stronger model for final reasoning
2,000 historical tickets with human-reviewed resolutions; 150 tickets can be labeled deeply for a golden set

Task

You are not being asked to only list generic risks. Design how you would think about failure modes when integrating this LLM into a real product workflow.

Define the most important failure modes across retrieval, prompting, tool use, grounding, safety, and user interaction. Prioritize them by business impact and likelihood.
Propose an eval-first plan: offline evals before launch and online monitoring after launch. Include how you would measure hallucinations, prompt injection success, unsafe tool calls, and silent quality regressions.
Describe the architecture and control points you would add to prevent or contain failures, including when the system should refuse, ask for clarification, or escalate to a human.
Estimate cost and latency for your design, including where you would use smaller vs larger models.
Explain rollout strategy, guardrails, and rollback criteria if failure rates exceed thresholds.

Context

Constraints

p95 end-to-end latency: 2,500ms
Cost ceiling: $0.035 per request and $25K/month at 30K requests/day
Hallucination ceiling: <2% on policy-grounded recommendations
Unsafe action rate: 0% for irreversible actions (refund approval, account closure)
Must resist prompt injection in CRM notes and customer-provided text
Must not leak PII or hidden system instructions

Available Resources

80K internal policy and workflow documents, updated daily
CRM tickets with agent notes, customer messages, and structured account metadata
Tools: policy search, transaction lookup, refund eligibility checker, ticket escalation API
Approved models: a small fast model for classification/routing and a stronger model for final reasoning
2,000 historical tickets with human-reviewed resolutions; 150 tickets can be labeled deeply for a golden set

Task

You are not being asked to only list generic risks. Design how you would think about failure modes when integrating this LLM into a real product workflow.

Define the most important failure modes across retrieval, prompting, tool use, grounding, safety, and user interaction. Prioritize them by business impact and likelihood.
Propose an eval-first plan: offline evals before launch and online monitoring after launch. Include how you would measure hallucinations, prompt injection success, unsafe tool calls, and silent quality regressions.
Describe the architecture and control points you would add to prevent or contain failures, including when the system should refuse, ask for clarification, or escalate to a human.
Estimate cost and latency for your design, including where you would use smaller vs larger models.
Explain rollout strategy, guardrails, and rollback criteria if failure rates exceed thresholds.

Context

Constraints

p95 end-to-end latency: 2,500ms
Cost ceiling: $0.035 per request and $25K/month at 30K requests/day
Hallucination ceiling: <2% on policy-grounded recommendations
Unsafe action rate: 0% for irreversible actions (refund approval, account closure)
Must resist prompt injection in CRM notes and customer-provided text
Must not leak PII or hidden system instructions

Available Resources

80K internal policy and workflow documents, updated daily
CRM tickets with agent notes, customer messages, and structured account metadata
Tools: policy search, transaction lookup, refund eligibility checker, ticket escalation API
Approved models: a small fast model for classification/routing and a stronger model for final reasoning
2,000 historical tickets with human-reviewed resolutions; 150 tickets can be labeled deeply for a golden set

Task

You are not being asked to only list generic risks. Design how you would think about failure modes when integrating this LLM into a real product workflow.

Define the most important failure modes across retrieval, prompting, tool use, grounding, safety, and user interaction. Prioritize them by business impact and likelihood.
Propose an eval-first plan: offline evals before launch and online monitoring after launch. Include how you would measure hallucinations, prompt injection success, unsafe tool calls, and silent quality regressions.
Describe the architecture and control points you would add to prevent or contain failures, including when the system should refuse, ask for clarification, or escalate to a human.
Estimate cost and latency for your design, including where you would use smaller vs larger models.
Explain rollout strategy, guardrails, and rollback criteria if failure rates exceed thresholds.

Context

Constraints

p95 end-to-end latency: 2,500ms
Cost ceiling: $0.035 per request and $25K/month at 30K requests/day
Hallucination ceiling: <2% on policy-grounded recommendations
Unsafe action rate: 0% for irreversible actions (refund approval, account closure)
Must resist prompt injection in CRM notes and customer-provided text
Must not leak PII or hidden system instructions

Available Resources

80K internal policy and workflow documents, updated daily
CRM tickets with agent notes, customer messages, and structured account metadata
Tools: policy search, transaction lookup, refund eligibility checker, ticket escalation API
Approved models: a small fast model for classification/routing and a stronger model for final reasoning
2,000 historical tickets with human-reviewed resolutions; 150 tickets can be labeled deeply for a golden set

Task

You are not being asked to only list generic risks. Design how you would think about failure modes when integrating this LLM into a real product workflow.

Define the most important failure modes across retrieval, prompting, tool use, grounding, safety, and user interaction. Prioritize them by business impact and likelihood.
Propose an eval-first plan: offline evals before launch and online monitoring after launch. Include how you would measure hallucinations, prompt injection success, unsafe tool calls, and silent quality regressions.
Describe the architecture and control points you would add to prevent or contain failures, including when the system should refuse, ask for clarification, or escalate to a human.
Estimate cost and latency for your design, including where you would use smaller vs larger models.
Explain rollout strategy, guardrails, and rollback criteria if failure rates exceed thresholds.

Interview Guides

Context

Constraints

Available Resources

Task

Failure Modes in LLM Workflows

Context

Constraints

Available Resources

Task

Your Answer

Failure Modes in LLM Workflows

Context

Constraints

Available Resources

Task

Failure Modes in LLM Workflows

Context

Constraints

Available Resources

Task

Your Answer