Context
BrightInbox is adding an AI writing assistant that generates long-form email replies, account summaries, and follow-up drafts. Some requests take several seconds and may require retrieval or tool calls, so the product team wants an asynchronous pipeline that reliably returns results to users without duplicate jobs, lost outputs, or unsafe content.
Constraints
- API acknowledgment to client: p95 < 300ms
- End-to-end job completion: p95 < 12s, p99 < 30s
- Cost ceiling: <$9 per 1,000 completed jobs
- Hallucination rate on grounded tasks: <2% on a labeled offline set
- Prompt injection success rate from retrieved content or tool output: <0.5%
- At-least-once queue delivery is acceptable, but user-visible duplicate outputs are not
- Users must see accurate job state: queued, running, succeeded, failed, expired
Available Resources
- 2M historical support emails and CRM notes, with document-level ACLs
- Existing Postgres, Redis, object storage, and a managed queue
- Approved models: GPT-4.1-mini for orchestration, GPT-4.1 for high-risk generations
- Optional retrieval over CRM notes and help-center articles
- Web app, mobile app, and webhook callback support for result delivery
Task
- Design the end-to-end asynchronous architecture: request intake, idempotent job creation, queueing, worker execution, persistence, and result delivery back to users.
- Specify how you would structure prompts and outputs so workers can safely produce typed results, citations, status, and refusal reasons when needed.
- Define an eval-first plan covering offline quality/safety evaluation and online reliability/product metrics before finalizing architecture.
- Explain how you would handle retries, partial failures, duplicate messages, timeouts, cancellation, and replay/backfill without surfacing inconsistent states.
- Estimate cost and latency, and describe the main tradeoffs between model quality, throughput, and reliability safeguards.