Context
BrightDesk sells an LLM-powered customer-support drafting API to enterprise teams. A new customer wants the model to produce highly consistent outputs across agents, channels, and repeated runs, without building a full RAG system yet.
Constraints
- p95 latency: 1,200ms per request
- Cost ceiling: $8 per 1,000 requests
- Output consistency target: at least 90% schema-valid responses on a 300-prompt golden set
- Hallucination ceiling: fewer than 2% unsupported policy claims in offline evaluation
- Safety: must resist prompt injection attempts in user input and avoid leaking hidden instructions
- The customer may later localize prompts into 3 languages, so the design should be maintainable
Available Resources
- Historical dataset of 20,000 support prompts and agent-written ideal responses
- A 300-example golden set labeled for format adherence, factuality, tone, and refusal correctness
- Two approved hosted models: a lower-cost fast model and a higher-quality mid-tier model
- Existing API gateway that can enforce JSON schema validation and retries
- No retrieval layer for this phase; answers must rely only on provided business rules and user input
Task
- Design a prompt engineering approach that maximizes output consistency, including prompt structure, delimiters, examples, and structured output requirements.
- Define an evaluation plan before rollout, including offline tests for consistency, hallucination, and prompt injection resistance, plus online monitoring after launch.
- Propose the serving architecture, including model choice, fallback/retry behavior, schema validation, and versioning strategy for prompts.
- Estimate cost and latency at 200,000 requests per month, and explain what tradeoffs you would make if the team must reduce cost by 40%.
- Identify the main failure modes for inconsistent outputs and explain how you would detect and mitigate them in production.