Context
PulseNote is building an AI writing assistant inside a mobile health-journaling app. The feature rewrites user notes, summarizes recent entries, and answers simple questions about past journal content. You need to decide whether inference should run primarily on-device, server-side, or via a hybrid routing strategy.
Constraints
- p95 latency: <700ms for rewrite/summarize actions
- Cost ceiling: <$0.015 per daily active user per month at 2M MAU
- Hallucination ceiling: <2% on a labeled factual-grounding set for history-based answers
- Privacy: raw journal text for minors and EU users should not leave device unless explicitly consented
- Safety: system must resist prompt injection from retrieved journal content and refuse unsupported medical advice
- App size increase from on-device models must stay under 250MB
- Battery impact should be acceptable on mid-tier Android devices
Available Resources
- 8M anonymized historical prompts/responses from the current cloud assistant
- 25K labeled evaluation examples across rewrite, summarization, grounded Q&A, refusal, and adversarial injection cases
- Candidate models:
- On-device 1B and 3B quantized instruction models
- Server-side GPT-4.1-mini / Claude Sonnet class models
- Embedding model for local or server retrieval over a user’s recent journal entries
- Mobile telemetry for latency, battery, crash rate, and network quality
Task
- Propose an evaluation-first framework to decide between on-device, server-side, and hybrid inference, including offline and online metrics.
- Design the serving architecture, including what tasks run locally vs remotely, fallback behavior, and how privacy policy affects routing.
- Write a system prompt that enforces grounded answers, refusal behavior, and resistance to prompt injection from user-authored journal text.
- Estimate cost and latency for each option and recommend one approach, with explicit tradeoffs.
- Identify key failure modes, monitoring, and rollout safeguards before launching to all users.