Context
Jasper's customer-success team is hearing a common complaint from mid-market customers: the same prompt often produces noticeably different marketing copy quality across runs. You own the LLM quality workstream for Jasper's brand-content assistant, which generates product descriptions, ad copy, and email drafts from a customer's brand voice settings and campaign brief.
Constraints
- p95 latency: 2,500ms per generation request
- Cost ceiling: $0.03 per request and $40K/month at current volume
- Quality target: reduce "inconsistent output" complaints by 30% within one quarter
- Hallucination ceiling: <2% factual errors on a 300-prompt golden set
- Safety: must not leak customer brand guidelines across tenants; must resist prompt attempts like "ignore brand rules" or "write in a competitor's voice"
- Product requirement: users should still be able to request creative variation intentionally
Available Resources
- Historical logs: prompt, model settings, generated output, user edits, thumbs up/down, regenerate events
- Customer-specific brand guidelines, tone settings, and approved example copy
- Existing prompt templates and a small set of manually curated "good vs bad consistency" examples
- Access to a fast small model and a higher-quality larger model from an approved provider
- Ability to ship prompt changes, retrieval of brand guidelines, and lightweight fine-tuning if justified
Deliverables
- Define how you would diagnose whether inconsistency is caused by prompt design, decoding settings, missing context, model choice, or tenant-specific brand ambiguity.
- Propose an eval-first solution to improve consistency without making outputs bland, including offline and online metrics.
- Design the prompting and serving architecture, including how brand guidelines and examples are injected and when to use deterministic vs higher-variance generation.
- Explain whether you would use prompt changes only, retrieval, lightweight fine-tuning, or a hybrid approach, and justify the cost/latency trade-offs.
- Identify key failure modes, including hallucination, prompt injection, and cross-tenant leakage, and how you would detect and mitigate them.