Interview Guides

Monitor a Subtle-Failure AI Workflow

Hard

Generative AI & LLMs

Context

FinFlow uses an LLM-powered operations workflow to process inbound vendor emails, extract structured fields, retrieve policy documents, and draft a recommended action for an analyst to approve. The workflow usually "works," but failures are often subtle: wrong field extraction, stale retrieval, unsafe tool use, overconfident recommendations, or silent degradation after prompt/model changes.

Constraints

p95 end-to-end latency: 4,000ms per workflow run
Cost ceiling: $12K/month at 300K runs/month
Hallucination / unsupported recommendation rate: <2% on a labeled audit set
Prompt-injection success rate from email or retrieved docs: <0.5%
Human-review queue cannot increase by more than 10%
All monitoring must avoid storing raw PII in logs

Available Resources

Workflow stages: email classification, structured extraction, retrieval over 80K policy docs, recommendation generation, optional tool calls to CRM and ticketing APIs
Historical data: 1.2M prior workflow runs, 8K manually audited outcomes, analyst overrides, downstream resolution status, and user complaints
Approved models: GPT-4.1-mini / GPT-4.1, embeddings model, and a lightweight moderation model
Existing observability stack: OpenTelemetry traces, metrics backend, alerting via PagerDuty and Slack

Task

Design a monitoring and alerting strategy for this workflow, including stage-level, end-to-end, and business-impact metrics.
Define an eval-first framework: offline audit sets, adversarial tests, and online guardrails that detect subtle failures before and after launch.
Propose how to detect hallucination, prompt injection, extraction drift, retrieval regressions, and unsafe tool behavior in production.
Specify alert thresholds, dashboards, triage playbooks, and rollback criteria for prompt/model/retrieval changes.
Explain the cost/latency tradeoffs of your monitoring design, including what to sample, what to score asynchronously, and what must block the workflow synchronously.

Monitor a Subtle-Failure AI Workflow

Hard

Generative AI & LLMs

Context

Constraints

p95 end-to-end latency: 4,000ms per workflow run
Cost ceiling: $12K/month at 300K runs/month
Hallucination / unsupported recommendation rate: <2% on a labeled audit set
Prompt-injection success rate from email or retrieved docs: <0.5%
Human-review queue cannot increase by more than 10%
All monitoring must avoid storing raw PII in logs

Available Resources

Workflow stages: email classification, structured extraction, retrieval over 80K policy docs, recommendation generation, optional tool calls to CRM and ticketing APIs
Historical data: 1.2M prior workflow runs, 8K manually audited outcomes, analyst overrides, downstream resolution status, and user complaints
Approved models: GPT-4.1-mini / GPT-4.1, embeddings model, and a lightweight moderation model
Existing observability stack: OpenTelemetry traces, metrics backend, alerting via PagerDuty and Slack

Task

Design a monitoring and alerting strategy for this workflow, including stage-level, end-to-end, and business-impact metrics.
Define an eval-first framework: offline audit sets, adversarial tests, and online guardrails that detect subtle failures before and after launch.
Propose how to detect hallucination, prompt injection, extraction drift, retrieval regressions, and unsafe tool behavior in production.
Specify alert thresholds, dashboards, triage playbooks, and rollback criteria for prompt/model/retrieval changes.
Explain the cost/latency tradeoffs of your monitoring design, including what to sample, what to score asynchronously, and what must block the workflow synchronously.

Your Answer

Monitor a Subtle-Failure AI Workflow

Hard

Generative AI & LLMs

Context

Constraints

p95 end-to-end latency: 4,000ms per workflow run
Cost ceiling: $12K/month at 300K runs/month
Hallucination / unsupported recommendation rate: <2% on a labeled audit set
Prompt-injection success rate from email or retrieved docs: <0.5%
Human-review queue cannot increase by more than 10%
All monitoring must avoid storing raw PII in logs

Available Resources

Workflow stages: email classification, structured extraction, retrieval over 80K policy docs, recommendation generation, optional tool calls to CRM and ticketing APIs
Historical data: 1.2M prior workflow runs, 8K manually audited outcomes, analyst overrides, downstream resolution status, and user complaints
Approved models: GPT-4.1-mini / GPT-4.1, embeddings model, and a lightweight moderation model
Existing observability stack: OpenTelemetry traces, metrics backend, alerting via PagerDuty and Slack

Task

Design a monitoring and alerting strategy for this workflow, including stage-level, end-to-end, and business-impact metrics.
Define an eval-first framework: offline audit sets, adversarial tests, and online guardrails that detect subtle failures before and after launch.
Propose how to detect hallucination, prompt injection, extraction drift, retrieval regressions, and unsafe tool behavior in production.
Specify alert thresholds, dashboards, triage playbooks, and rollback criteria for prompt/model/retrieval changes.
Explain the cost/latency tradeoffs of your monitoring design, including what to sample, what to score asynchronously, and what must block the workflow synchronously.

Monitor a Subtle-Failure AI Workflow

Hard

Generative AI & LLMs

Context

Constraints

p95 end-to-end latency: 4,000ms per workflow run
Cost ceiling: $12K/month at 300K runs/month
Hallucination / unsupported recommendation rate: <2% on a labeled audit set
Prompt-injection success rate from email or retrieved docs: <0.5%
Human-review queue cannot increase by more than 10%
All monitoring must avoid storing raw PII in logs

Available Resources

Workflow stages: email classification, structured extraction, retrieval over 80K policy docs, recommendation generation, optional tool calls to CRM and ticketing APIs
Historical data: 1.2M prior workflow runs, 8K manually audited outcomes, analyst overrides, downstream resolution status, and user complaints
Approved models: GPT-4.1-mini / GPT-4.1, embeddings model, and a lightweight moderation model
Existing observability stack: OpenTelemetry traces, metrics backend, alerting via PagerDuty and Slack

Task

Design a monitoring and alerting strategy for this workflow, including stage-level, end-to-end, and business-impact metrics.
Define an eval-first framework: offline audit sets, adversarial tests, and online guardrails that detect subtle failures before and after launch.
Propose how to detect hallucination, prompt injection, extraction drift, retrieval regressions, and unsafe tool behavior in production.
Specify alert thresholds, dashboards, triage playbooks, and rollback criteria for prompt/model/retrieval changes.
Explain the cost/latency tradeoffs of your monitoring design, including what to sample, what to score asynchronously, and what must block the workflow synchronously.