Context
PulseChat is adding an AI writing assistant to its mobile app. The feature helps users rewrite, summarize, and draft short messages, captions, and replies directly on-device UI surfaces, but generation is served from the cloud.
Constraints
- p95 end-to-end latency: 900ms on mobile networks
- Cost ceiling: $8 per 1,000 assisted generations
- Unsafe or policy-violating output shown to users: <0.5%
- Hallucinated factual claims in assistive rewrites/summaries: <2% on a labeled eval set
- Prompt injection success rate from pasted user content: <1%
- Must degrade gracefully: if confidence is low, return a safer rewrite or refuse
- No raw message logs containing PII may be stored longer than 7 days
Available Resources
- 2M historical, human-written mobile messages and captions with user consent for model evaluation only
- A policy taxonomy covering self-harm, harassment, sexual content, minors, medical/legal/financial advice, and privacy leaks
- An approved LLM API (OpenAI or Anthropic), plus a smaller moderation/classification model
- Mobile client can send user locale, coarse age band, and feature intent (
rewrite, summarize, reply_suggest)
- A red-team set of adversarial prompts, including pasted text that says things like “ignore previous instructions”
Task
- Design the end-to-end guardrail strategy for AI-generated mobile content, including pre-generation checks, prompt design, post-generation validation, and fallback behavior.
- Define an evaluation-first plan: offline safety and quality benchmarks, calibration, and online guardrail metrics after launch.
- Propose the serving architecture and model routing strategy that meets both latency and cost constraints.
- Write a production-grade system prompt that constrains output style, refusal behavior, and treatment of user-provided text as untrusted data.
- Identify the top failure modes for mobile AI content generation and how you would detect and mitigate them in production.