Interview Guides

Design Trustworthy AI Workflow Copilot

Hard

Generative AI & LLMs

Context

RelayOps is building an AI-assisted workflow copilot for operations teams. The feature drafts actions such as summarizing tickets, proposing next steps, filling forms, and preparing tool actions, but users must feel in control and trust that the system will not take unsafe or unsupported actions.

Constraints

p95 end-to-end latency: 2,500ms for read-only tasks; 4,000ms for tasks involving tool planning
Cost ceiling: $35K/month at 1.2M requests/month
Hallucination ceiling: <2% materially incorrect claims on a 400-task golden set
Unauthorized action rate: 0%; any external action must require explicit user confirmation
Prompt-injection success rate from tool outputs or retrieved content: <0.5%
Must support audit logs, editable drafts, and clear provenance for recommendations

Available Resources

Historical workflow logs: 2M prior tickets, notes, resolutions, and user edits
Internal knowledge base: 120K documents (policies, SOPs, runbooks)
Tools: ticketing API, CRM API, calendar API, internal search, and approval service
Approved models: one high-quality frontier model and one cheaper fast model
25 domain experts available to label evaluation sets and review failures

Task

Design the end-to-end architecture for an AI workflow system that preserves user trust and control, including when to use RAG, when to call tools, and where human confirmation is required.
Write a system prompt that enforces grounded recommendations, explicit uncertainty, refusal behavior, and a strict “draft-first / confirm-before-act” policy.
Define an evaluation plan before implementation, including offline and online metrics for trust, hallucination, unauthorized actions, and prompt-injection resilience.
Estimate cost and latency for the proposed design, and explain how you would tier models or cache work to stay within budget.
Identify the main failure modes and mitigations, especially around hallucinated actions, stale knowledge, prompt injection, and over-automation.

Design Trustworthy AI Workflow Copilot

Hard

Generative AI & LLMs

Context

Constraints

p95 end-to-end latency: 2,500ms for read-only tasks; 4,000ms for tasks involving tool planning
Cost ceiling: $35K/month at 1.2M requests/month
Hallucination ceiling: <2% materially incorrect claims on a 400-task golden set
Unauthorized action rate: 0%; any external action must require explicit user confirmation
Prompt-injection success rate from tool outputs or retrieved content: <0.5%
Must support audit logs, editable drafts, and clear provenance for recommendations

Available Resources

Historical workflow logs: 2M prior tickets, notes, resolutions, and user edits
Internal knowledge base: 120K documents (policies, SOPs, runbooks)
Tools: ticketing API, CRM API, calendar API, internal search, and approval service
Approved models: one high-quality frontier model and one cheaper fast model
25 domain experts available to label evaluation sets and review failures

Task

Design the end-to-end architecture for an AI workflow system that preserves user trust and control, including when to use RAG, when to call tools, and where human confirmation is required.
Write a system prompt that enforces grounded recommendations, explicit uncertainty, refusal behavior, and a strict “draft-first / confirm-before-act” policy.
Define an evaluation plan before implementation, including offline and online metrics for trust, hallucination, unauthorized actions, and prompt-injection resilience.
Estimate cost and latency for the proposed design, and explain how you would tier models or cache work to stay within budget.
Identify the main failure modes and mitigations, especially around hallucinated actions, stale knowledge, prompt injection, and over-automation.

Your Answer

Design Trustworthy AI Workflow Copilot

Hard

Generative AI & LLMs

Context

Constraints

p95 end-to-end latency: 2,500ms for read-only tasks; 4,000ms for tasks involving tool planning
Cost ceiling: $35K/month at 1.2M requests/month
Hallucination ceiling: <2% materially incorrect claims on a 400-task golden set
Unauthorized action rate: 0%; any external action must require explicit user confirmation
Prompt-injection success rate from tool outputs or retrieved content: <0.5%
Must support audit logs, editable drafts, and clear provenance for recommendations

Available Resources

Historical workflow logs: 2M prior tickets, notes, resolutions, and user edits
Internal knowledge base: 120K documents (policies, SOPs, runbooks)
Tools: ticketing API, CRM API, calendar API, internal search, and approval service
Approved models: one high-quality frontier model and one cheaper fast model
25 domain experts available to label evaluation sets and review failures

Task

Design the end-to-end architecture for an AI workflow system that preserves user trust and control, including when to use RAG, when to call tools, and where human confirmation is required.
Write a system prompt that enforces grounded recommendations, explicit uncertainty, refusal behavior, and a strict “draft-first / confirm-before-act” policy.
Define an evaluation plan before implementation, including offline and online metrics for trust, hallucination, unauthorized actions, and prompt-injection resilience.
Estimate cost and latency for the proposed design, and explain how you would tier models or cache work to stay within budget.
Identify the main failure modes and mitigations, especially around hallucinated actions, stale knowledge, prompt injection, and over-automation.

Design Trustworthy AI Workflow Copilot

Hard

Generative AI & LLMs

Context

Constraints

p95 end-to-end latency: 2,500ms for read-only tasks; 4,000ms for tasks involving tool planning
Cost ceiling: $35K/month at 1.2M requests/month
Hallucination ceiling: <2% materially incorrect claims on a 400-task golden set
Unauthorized action rate: 0%; any external action must require explicit user confirmation
Prompt-injection success rate from tool outputs or retrieved content: <0.5%
Must support audit logs, editable drafts, and clear provenance for recommendations

Available Resources

Historical workflow logs: 2M prior tickets, notes, resolutions, and user edits
Internal knowledge base: 120K documents (policies, SOPs, runbooks)
Tools: ticketing API, CRM API, calendar API, internal search, and approval service
Approved models: one high-quality frontier model and one cheaper fast model
25 domain experts available to label evaluation sets and review failures

Task

Design the end-to-end architecture for an AI workflow system that preserves user trust and control, including when to use RAG, when to call tools, and where human confirmation is required.
Write a system prompt that enforces grounded recommendations, explicit uncertainty, refusal behavior, and a strict “draft-first / confirm-before-act” policy.
Define an evaluation plan before implementation, including offline and online metrics for trust, hallucination, unauthorized actions, and prompt-injection resilience.
Estimate cost and latency for the proposed design, and explain how you would tier models or cache work to stay within budget.
Identify the main failure modes and mitigations, especially around hallucinated actions, stale knowledge, prompt injection, and over-automation.