Design Async LLM Job Delivery

Context

BrightInbox is adding an AI writing assistant that generates long-form email replies, account summaries, and follow-up drafts. Some requests take several seconds and may require retrieval or tool calls, so the product team wants an asynchronous pipeline that reliably returns results to users without duplicate jobs, lost outputs, or unsafe content.

Constraints

API acknowledgment to client: p95 < 300ms
End-to-end job completion: p95 < 12s, p99 < 30s
Cost ceiling: <$9 per 1,000 completed jobs
Hallucination rate on grounded tasks: <2% on a labeled offline set
Prompt injection success rate from retrieved content or tool output: <0.5%
At-least-once queue delivery is acceptable, but user-visible duplicate outputs are not
Users must see accurate job state: queued, running, succeeded, failed, expired

Available Resources

2M historical support emails and CRM notes, with document-level ACLs
Existing Postgres, Redis, object storage, and a managed queue
Approved models: GPT-4.1-mini for orchestration, GPT-4.1 for high-risk generations
Optional retrieval over CRM notes and help-center articles
Web app, mobile app, and webhook callback support for result delivery

Task

Design the end-to-end asynchronous architecture: request intake, idempotent job creation, queueing, worker execution, persistence, and result delivery back to users.
Specify how you would structure prompts and outputs so workers can safely produce typed results, citations, status, and refusal reasons when needed.
Define an eval-first plan covering offline quality/safety evaluation and online reliability/product metrics before finalizing architecture.
Explain how you would handle retries, partial failures, duplicate messages, timeouts, cancellation, and replay/backfill without surfacing inconsistent states.
Estimate cost and latency, and describe the main tradeoffs between model quality, throughput, and reliability safeguards.

Context

Constraints

API acknowledgment to client: p95 < 300ms
End-to-end job completion: p95 < 12s, p99 < 30s
Cost ceiling: <$9 per 1,000 completed jobs
Hallucination rate on grounded tasks: <2% on a labeled offline set
Prompt injection success rate from retrieved content or tool output: <0.5%
At-least-once queue delivery is acceptable, but user-visible duplicate outputs are not
Users must see accurate job state: queued, running, succeeded, failed, expired

Available Resources

2M historical support emails and CRM notes, with document-level ACLs
Existing Postgres, Redis, object storage, and a managed queue
Approved models: GPT-4.1-mini for orchestration, GPT-4.1 for high-risk generations
Optional retrieval over CRM notes and help-center articles
Web app, mobile app, and webhook callback support for result delivery

Task

Design the end-to-end asynchronous architecture: request intake, idempotent job creation, queueing, worker execution, persistence, and result delivery back to users.
Specify how you would structure prompts and outputs so workers can safely produce typed results, citations, status, and refusal reasons when needed.
Define an eval-first plan covering offline quality/safety evaluation and online reliability/product metrics before finalizing architecture.
Explain how you would handle retries, partial failures, duplicate messages, timeouts, cancellation, and replay/backfill without surfacing inconsistent states.
Estimate cost and latency, and describe the main tradeoffs between model quality, throughput, and reliability safeguards.

Context

Constraints

API acknowledgment to client: p95 < 300ms
End-to-end job completion: p95 < 12s, p99 < 30s
Cost ceiling: <$9 per 1,000 completed jobs
Hallucination rate on grounded tasks: <2% on a labeled offline set
Prompt injection success rate from retrieved content or tool output: <0.5%
At-least-once queue delivery is acceptable, but user-visible duplicate outputs are not
Users must see accurate job state: queued, running, succeeded, failed, expired

Available Resources

2M historical support emails and CRM notes, with document-level ACLs
Existing Postgres, Redis, object storage, and a managed queue
Approved models: GPT-4.1-mini for orchestration, GPT-4.1 for high-risk generations
Optional retrieval over CRM notes and help-center articles
Web app, mobile app, and webhook callback support for result delivery

Task

Design the end-to-end asynchronous architecture: request intake, idempotent job creation, queueing, worker execution, persistence, and result delivery back to users.
Specify how you would structure prompts and outputs so workers can safely produce typed results, citations, status, and refusal reasons when needed.
Define an eval-first plan covering offline quality/safety evaluation and online reliability/product metrics before finalizing architecture.
Explain how you would handle retries, partial failures, duplicate messages, timeouts, cancellation, and replay/backfill without surfacing inconsistent states.
Estimate cost and latency, and describe the main tradeoffs between model quality, throughput, and reliability safeguards.

Context

Constraints

API acknowledgment to client: p95 < 300ms
End-to-end job completion: p95 < 12s, p99 < 30s
Cost ceiling: <$9 per 1,000 completed jobs
Hallucination rate on grounded tasks: <2% on a labeled offline set
Prompt injection success rate from retrieved content or tool output: <0.5%
At-least-once queue delivery is acceptable, but user-visible duplicate outputs are not
Users must see accurate job state: queued, running, succeeded, failed, expired

Available Resources

2M historical support emails and CRM notes, with document-level ACLs
Existing Postgres, Redis, object storage, and a managed queue
Approved models: GPT-4.1-mini for orchestration, GPT-4.1 for high-risk generations
Optional retrieval over CRM notes and help-center articles
Web app, mobile app, and webhook callback support for result delivery

Task

Design the end-to-end asynchronous architecture: request intake, idempotent job creation, queueing, worker execution, persistence, and result delivery back to users.
Specify how you would structure prompts and outputs so workers can safely produce typed results, citations, status, and refusal reasons when needed.
Define an eval-first plan covering offline quality/safety evaluation and online reliability/product metrics before finalizing architecture.
Explain how you would handle retries, partial failures, duplicate messages, timeouts, cancellation, and replay/backfill without surfacing inconsistent states.
Estimate cost and latency, and describe the main tradeoffs between model quality, throughput, and reliability safeguards.

Interview Guides

Context

Constraints

Available Resources

Task

Design Async LLM Job Delivery

Context

Constraints

Available Resources

Task

Your Answer

Design Async LLM Job Delivery

Context

Constraints

Available Resources

Task

Design Async LLM Job Delivery

Context

Constraints

Available Resources

Task

Your Answer