Choose On-Device vs Server Inference

Context

PulseNote is building an AI writing assistant inside a mobile health-journaling app. The feature rewrites user notes, summarizes recent entries, and answers simple questions about past journal content. You need to decide whether inference should run primarily on-device, server-side, or via a hybrid routing strategy.

Constraints

p95 latency: <700ms for rewrite/summarize actions
Cost ceiling: <$0.015 per daily active user per month at 2M MAU
Hallucination ceiling: <2% on a labeled factual-grounding set for history-based answers
Privacy: raw journal text for minors and EU users should not leave device unless explicitly consented
Safety: system must resist prompt injection from retrieved journal content and refuse unsupported medical advice
App size increase from on-device models must stay under 250MB
Battery impact should be acceptable on mid-tier Android devices

Available Resources

8M anonymized historical prompts/responses from the current cloud assistant
25K labeled evaluation examples across rewrite, summarization, grounded Q&A, refusal, and adversarial injection cases
Candidate models:
- On-device 1B and 3B quantized instruction models
- Server-side GPT-4.1-mini / Claude Sonnet class models
- Embedding model for local or server retrieval over a user’s recent journal entries
Mobile telemetry for latency, battery, crash rate, and network quality

Task

Propose an evaluation-first framework to decide between on-device, server-side, and hybrid inference, including offline and online metrics.
Design the serving architecture, including what tasks run locally vs remotely, fallback behavior, and how privacy policy affects routing.
Write a system prompt that enforces grounded answers, refusal behavior, and resistance to prompt injection from user-authored journal text.
Estimate cost and latency for each option and recommend one approach, with explicit tradeoffs.
Identify key failure modes, monitoring, and rollout safeguards before launching to all users.

Context

Constraints

p95 latency: <700ms for rewrite/summarize actions
Cost ceiling: <$0.015 per daily active user per month at 2M MAU
Hallucination ceiling: <2% on a labeled factual-grounding set for history-based answers
Privacy: raw journal text for minors and EU users should not leave device unless explicitly consented
Safety: system must resist prompt injection from retrieved journal content and refuse unsupported medical advice
App size increase from on-device models must stay under 250MB
Battery impact should be acceptable on mid-tier Android devices

Available Resources

8M anonymized historical prompts/responses from the current cloud assistant
25K labeled evaluation examples across rewrite, summarization, grounded Q&A, refusal, and adversarial injection cases
Candidate models:
- On-device 1B and 3B quantized instruction models
- Server-side GPT-4.1-mini / Claude Sonnet class models
- Embedding model for local or server retrieval over a user’s recent journal entries
Mobile telemetry for latency, battery, crash rate, and network quality

Task

Propose an evaluation-first framework to decide between on-device, server-side, and hybrid inference, including offline and online metrics.
Design the serving architecture, including what tasks run locally vs remotely, fallback behavior, and how privacy policy affects routing.
Write a system prompt that enforces grounded answers, refusal behavior, and resistance to prompt injection from user-authored journal text.
Estimate cost and latency for each option and recommend one approach, with explicit tradeoffs.
Identify key failure modes, monitoring, and rollout safeguards before launching to all users.

Context

Constraints

p95 latency: <700ms for rewrite/summarize actions
Cost ceiling: <$0.015 per daily active user per month at 2M MAU
Hallucination ceiling: <2% on a labeled factual-grounding set for history-based answers
Privacy: raw journal text for minors and EU users should not leave device unless explicitly consented
Safety: system must resist prompt injection from retrieved journal content and refuse unsupported medical advice
App size increase from on-device models must stay under 250MB
Battery impact should be acceptable on mid-tier Android devices

Available Resources

8M anonymized historical prompts/responses from the current cloud assistant
25K labeled evaluation examples across rewrite, summarization, grounded Q&A, refusal, and adversarial injection cases
Candidate models:
- On-device 1B and 3B quantized instruction models
- Server-side GPT-4.1-mini / Claude Sonnet class models
- Embedding model for local or server retrieval over a user’s recent journal entries
Mobile telemetry for latency, battery, crash rate, and network quality

Task

Propose an evaluation-first framework to decide between on-device, server-side, and hybrid inference, including offline and online metrics.
Design the serving architecture, including what tasks run locally vs remotely, fallback behavior, and how privacy policy affects routing.
Write a system prompt that enforces grounded answers, refusal behavior, and resistance to prompt injection from user-authored journal text.
Estimate cost and latency for each option and recommend one approach, with explicit tradeoffs.
Identify key failure modes, monitoring, and rollout safeguards before launching to all users.

Context

Constraints

p95 latency: <700ms for rewrite/summarize actions
Cost ceiling: <$0.015 per daily active user per month at 2M MAU
Hallucination ceiling: <2% on a labeled factual-grounding set for history-based answers
Privacy: raw journal text for minors and EU users should not leave device unless explicitly consented
Safety: system must resist prompt injection from retrieved journal content and refuse unsupported medical advice
App size increase from on-device models must stay under 250MB
Battery impact should be acceptable on mid-tier Android devices

Available Resources

8M anonymized historical prompts/responses from the current cloud assistant
25K labeled evaluation examples across rewrite, summarization, grounded Q&A, refusal, and adversarial injection cases
Candidate models:
- On-device 1B and 3B quantized instruction models
- Server-side GPT-4.1-mini / Claude Sonnet class models
- Embedding model for local or server retrieval over a user’s recent journal entries
Mobile telemetry for latency, battery, crash rate, and network quality

Task

Propose an evaluation-first framework to decide between on-device, server-side, and hybrid inference, including offline and online metrics.
Design the serving architecture, including what tasks run locally vs remotely, fallback behavior, and how privacy policy affects routing.
Write a system prompt that enforces grounded answers, refusal behavior, and resistance to prompt injection from user-authored journal text.
Estimate cost and latency for each option and recommend one approach, with explicit tradeoffs.
Identify key failure modes, monitoring, and rollout safeguards before launching to all users.

Interview Guides

Context

Constraints

Available Resources

Task

Choose On-Device vs Server Inference

Context

Constraints

Available Resources

Task

Your Answer

Choose On-Device vs Server Inference

Context

Constraints

Available Resources

Task

Choose On-Device vs Server Inference

Context

Constraints

Available Resources

Task

Your Answer