Place Mobile AI Inference

Context

FitSnap is adding an AI-powered mobile feature that lets users ask free-form questions about a workout photo and receive coaching tips, safety warnings, and a short structured summary. You need to decide where inference should run: fully on-device, on an edge location close to the user, or in a central server environment.

Constraints

p95 end-to-end latency: < 900ms for a short query + image metadata path
Cost ceiling: < $0.015 per request at 8M requests/month
Hallucination ceiling: < 2% on a labeled safety-focused golden set
Privacy: raw images from minors and health-related text should not leave approved regions; some users opt out of cloud processing entirely
Reliability: feature should degrade gracefully offline or on poor networks
Safety: must resist prompt injection in user-provided text or OCR-extracted image text, and must refuse unsafe medical advice

Available Resources

A mobile app on recent iOS/Android devices with access to NPUs on ~55% of active devices
Regional edge POPs in North America and Europe
Central server inference with access to larger multimodal models and policy filters
Historical logs: 200K anonymized queries, 5K human-reviewed outcomes, and a 1K-example adversarial set for jailbreaks / prompt injection
You may use a small on-device model, a medium edge model, and a larger server model, plus deterministic policy checks

Task

Propose an evaluation-first plan to decide between on-device, edge, server, or a hybrid routing strategy before committing to architecture.
Design the inference placement strategy, including what runs where, fallback behavior, and how user privacy / consent affects routing.
Specify the prompting and output contract needed to keep responses concise, safe, and structured.
Estimate latency and cost for your preferred design at target volume, including the impact of caching, model size, and network hops.
Identify the main failure modes, especially hallucination, prompt injection, privacy leakage, and regional compliance violations.

Context

Constraints

p95 end-to-end latency: < 900ms for a short query + image metadata path
Cost ceiling: < $0.015 per request at 8M requests/month
Hallucination ceiling: < 2% on a labeled safety-focused golden set
Privacy: raw images from minors and health-related text should not leave approved regions; some users opt out of cloud processing entirely
Reliability: feature should degrade gracefully offline or on poor networks
Safety: must resist prompt injection in user-provided text or OCR-extracted image text, and must refuse unsafe medical advice

Available Resources

A mobile app on recent iOS/Android devices with access to NPUs on ~55% of active devices
Regional edge POPs in North America and Europe
Central server inference with access to larger multimodal models and policy filters
Historical logs: 200K anonymized queries, 5K human-reviewed outcomes, and a 1K-example adversarial set for jailbreaks / prompt injection
You may use a small on-device model, a medium edge model, and a larger server model, plus deterministic policy checks

Task

Propose an evaluation-first plan to decide between on-device, edge, server, or a hybrid routing strategy before committing to architecture.
Design the inference placement strategy, including what runs where, fallback behavior, and how user privacy / consent affects routing.
Specify the prompting and output contract needed to keep responses concise, safe, and structured.
Estimate latency and cost for your preferred design at target volume, including the impact of caching, model size, and network hops.
Identify the main failure modes, especially hallucination, prompt injection, privacy leakage, and regional compliance violations.

Context

Constraints

p95 end-to-end latency: < 900ms for a short query + image metadata path
Cost ceiling: < $0.015 per request at 8M requests/month
Hallucination ceiling: < 2% on a labeled safety-focused golden set
Privacy: raw images from minors and health-related text should not leave approved regions; some users opt out of cloud processing entirely
Reliability: feature should degrade gracefully offline or on poor networks
Safety: must resist prompt injection in user-provided text or OCR-extracted image text, and must refuse unsafe medical advice

Available Resources

A mobile app on recent iOS/Android devices with access to NPUs on ~55% of active devices
Regional edge POPs in North America and Europe
Central server inference with access to larger multimodal models and policy filters
Historical logs: 200K anonymized queries, 5K human-reviewed outcomes, and a 1K-example adversarial set for jailbreaks / prompt injection
You may use a small on-device model, a medium edge model, and a larger server model, plus deterministic policy checks

Task

Propose an evaluation-first plan to decide between on-device, edge, server, or a hybrid routing strategy before committing to architecture.
Design the inference placement strategy, including what runs where, fallback behavior, and how user privacy / consent affects routing.
Specify the prompting and output contract needed to keep responses concise, safe, and structured.
Estimate latency and cost for your preferred design at target volume, including the impact of caching, model size, and network hops.
Identify the main failure modes, especially hallucination, prompt injection, privacy leakage, and regional compliance violations.

Context

Constraints

p95 end-to-end latency: < 900ms for a short query + image metadata path
Cost ceiling: < $0.015 per request at 8M requests/month
Hallucination ceiling: < 2% on a labeled safety-focused golden set
Privacy: raw images from minors and health-related text should not leave approved regions; some users opt out of cloud processing entirely
Reliability: feature should degrade gracefully offline or on poor networks
Safety: must resist prompt injection in user-provided text or OCR-extracted image text, and must refuse unsafe medical advice

Available Resources

A mobile app on recent iOS/Android devices with access to NPUs on ~55% of active devices
Regional edge POPs in North America and Europe
Central server inference with access to larger multimodal models and policy filters
Historical logs: 200K anonymized queries, 5K human-reviewed outcomes, and a 1K-example adversarial set for jailbreaks / prompt injection
You may use a small on-device model, a medium edge model, and a larger server model, plus deterministic policy checks

Task

Propose an evaluation-first plan to decide between on-device, edge, server, or a hybrid routing strategy before committing to architecture.
Design the inference placement strategy, including what runs where, fallback behavior, and how user privacy / consent affects routing.
Specify the prompting and output contract needed to keep responses concise, safe, and structured.
Estimate latency and cost for your preferred design at target volume, including the impact of caching, model size, and network hops.
Identify the main failure modes, especially hallucination, prompt injection, privacy leakage, and regional compliance violations.

Interview Guides

Context

Constraints

Available Resources

Task

Place Mobile AI Inference

Context

Constraints

Available Resources

Task

Your Answer

Place Mobile AI Inference

Context

Constraints

Available Resources

Task

Place Mobile AI Inference

Context

Constraints

Available Resources

Task

Your Answer