Design Async LLM Job Delivery

Hard

Generative AI & LLMs

RAGLLM AgentsStructured Extraction

Problem

Context

BrightInbox is adding an AI writing assistant that generates long-form email replies, account summaries, and follow-up drafts. Some requests take several seconds and may require retrieval or tool calls, so the product team wants an asynchronous pipeline that reliably returns results to users without duplicate jobs, lost outputs, or unsafe content.

Constraints

API acknowledgment to client: p95 < 300ms
End-to-end job completion: p95 < 12s, p99 < 30s
Cost ceiling: <$9 per 1,000 completed jobs
Hallucination rate on grounded tasks: <2% on a labeled offline set
Prompt injection success rate from retrieved content or tool output: <0.5%
At-least-once queue delivery is acceptable, but user-visible duplicate outputs are not
Users must see accurate job state: queued, running, succeeded, failed, expired

Available Resources

2M historical support emails and CRM notes, with document-level ACLs
Existing Postgres, Redis, object storage, and a managed queue
Approved models: GPT-4.1-mini for orchestration, GPT-4.1 for high-risk generations
Optional retrieval over CRM notes and help-center articles
Web app, mobile app, and webhook callback support for result delivery

Task

Design the end-to-end asynchronous architecture: request intake, idempotent job creation, queueing, worker execution, persistence, and result delivery back to users.
Specify how you would structure prompts and outputs so workers can safely produce typed results, citations, status, and refusal reasons when needed.
Define an eval-first plan covering offline quality/safety evaluation and online reliability/product metrics before finalizing architecture.
Explain how you would handle retries, partial failures, duplicate messages, timeouts, cancellation, and replay/backfill without surfacing inconsistent states.
Estimate cost and latency, and describe the main tradeoffs between model quality, throughput, and reliability safeguards.

Problem

Context

Constraints

API acknowledgment to client: p95 < 300ms
End-to-end job completion: p95 < 12s, p99 < 30s
Cost ceiling: <$9 per 1,000 completed jobs
Hallucination rate on grounded tasks: <2% on a labeled offline set
Prompt injection success rate from retrieved content or tool output: <0.5%
At-least-once queue delivery is acceptable, but user-visible duplicate outputs are not
Users must see accurate job state: queued, running, succeeded, failed, expired

Available Resources

2M historical support emails and CRM notes, with document-level ACLs
Existing Postgres, Redis, object storage, and a managed queue
Approved models: GPT-4.1-mini for orchestration, GPT-4.1 for high-risk generations
Optional retrieval over CRM notes and help-center articles
Web app, mobile app, and webhook callback support for result delivery

Task

Design the end-to-end asynchronous architecture: request intake, idempotent job creation, queueing, worker execution, persistence, and result delivery back to users.
Specify how you would structure prompts and outputs so workers can safely produce typed results, citations, status, and refusal reasons when needed.
Define an eval-first plan covering offline quality/safety evaluation and online reliability/product metrics before finalizing architecture.
Explain how you would handle retries, partial failures, duplicate messages, timeouts, cancellation, and replay/backfill without surfacing inconsistent states.
Estimate cost and latency, and describe the main tradeoffs between model quality, throughput, and reliability safeguards.

Your answer

Try one AI text evaluation on us

Get structured feedback, scored against a 4-axis rubric. Premium unlocks unlimited.

0 wordstarget ~200

Up next

Design Async LLM Job PlatformHard ADesign Customer Support AI AssistantEasy Evaluate AI Support Workflow ImpactHard

Next question

Context

Constraints

API acknowledgment to client: p95 < 300ms

End-to-end job completion: p95 < 12s, p99 < 30s

Cost ceiling: <$9 per 1,000 completed jobs

Hallucination rate on grounded tasks: <2% on a labeled offline set

Prompt injection success rate from retrieved content or tool output: <0.5%

At-least-once queue delivery is acceptable, but user-visible duplicate outputs are not

Users must see accurate job state: queued, running, succeeded, failed, expired

Available Resources

2M historical support emails and CRM notes, with document-level ACLs

Existing Postgres, Redis, object storage, and a managed queue

Approved models: GPT-4.1-mini for orchestration, GPT-4.1 for high-risk generations

Optional retrieval over CRM notes and help-center articles

Web app, mobile app, and webhook callback support for result delivery

Task

Design the end-to-end asynchronous architecture: request intake, idempotent job creation, queueing, worker execution, persistence, and result delivery back to users.

Specify how you would structure prompts and outputs so workers can safely produce typed results, citations, status, and refusal reasons when needed.

Define an eval-first plan covering offline quality/safety evaluation and online reliability/product metrics before finalizing architecture.

Explain how you would handle retries, partial failures, duplicate messages, timeouts, cancellation, and replay/backfill without surfacing inconsistent states.

Estimate cost and latency, and describe the main tradeoffs between model quality, throughput, and reliability safeguards.

Context

Constraints

API acknowledgment to client: p95 < 300ms

End-to-end job completion: p95 < 12s, p99 < 30s

Cost ceiling: <$9 per 1,000 completed jobs

Hallucination rate on grounded tasks: <2% on a labeled offline set

Prompt injection success rate from retrieved content or tool output: <0.5%

At-least-once queue delivery is acceptable, but user-visible duplicate outputs are not

Users must see accurate job state: queued, running, succeeded, failed, expired

Available Resources

2M historical support emails and CRM notes, with document-level ACLs

Existing Postgres, Redis, object storage, and a managed queue

Approved models: GPT-4.1-mini for orchestration, GPT-4.1 for high-risk generations

Optional retrieval over CRM notes and help-center articles

Web app, mobile app, and webhook callback support for result delivery

Task

Design the end-to-end asynchronous architecture: request intake, idempotent job creation, queueing, worker execution, persistence, and result delivery back to users.

Specify how you would structure prompts and outputs so workers can safely produce typed results, citations, status, and refusal reasons when needed.

Define an eval-first plan covering offline quality/safety evaluation and online reliability/product metrics before finalizing architecture.

Explain how you would handle retries, partial failures, duplicate messages, timeouts, cancellation, and replay/backfill without surfacing inconsistent states.

Estimate cost and latency, and describe the main tradeoffs between model quality, throughput, and reliability safeguards.