Context
FitSnap is adding an AI-powered mobile feature that lets users ask free-form questions about a workout photo and receive coaching tips, safety warnings, and a short structured summary. You need to decide where inference should run: fully on-device, on an edge location close to the user, or in a central server environment.
Constraints
- p95 end-to-end latency: < 900ms for a short query + image metadata path
- Cost ceiling: < $0.015 per request at 8M requests/month
- Hallucination ceiling: < 2% on a labeled safety-focused golden set
- Privacy: raw images from minors and health-related text should not leave approved regions; some users opt out of cloud processing entirely
- Reliability: feature should degrade gracefully offline or on poor networks
- Safety: must resist prompt injection in user-provided text or OCR-extracted image text, and must refuse unsafe medical advice
Available Resources
- A mobile app on recent iOS/Android devices with access to NPUs on ~55% of active devices
- Regional edge POPs in North America and Europe
- Central server inference with access to larger multimodal models and policy filters
- Historical logs: 200K anonymized queries, 5K human-reviewed outcomes, and a 1K-example adversarial set for jailbreaks / prompt injection
- You may use a small on-device model, a medium edge model, and a larger server model, plus deterministic policy checks
Task
- Propose an evaluation-first plan to decide between on-device, edge, server, or a hybrid routing strategy before committing to architecture.
- Design the inference placement strategy, including what runs where, fallback behavior, and how user privacy / consent affects routing.
- Specify the prompting and output contract needed to keep responses concise, safe, and structured.
- Estimate latency and cost for your preferred design at target volume, including the impact of caching, model size, and network hops.
- Identify the main failure modes, especially hallucination, prompt injection, privacy leakage, and regional compliance violations.