Choose Mobile Inference for Categorization

Context

ShopSnap is adding a smart categorization feature to its mobile app. When a user saves a note, receipt, photo caption, or short text snippet, the app should assign one of 120 product-defined categories and return a confidence score plus a short rationale for debugging.

Constraints

p95 end-to-end latency: <300ms on modern devices, <800ms on low-end devices or poor networks
Monthly inference budget: <$25K at 8M categorizations/month
Quality bar: macro-F1 0.88 on English,  0.80 on top 5 non-English locales
Hallucination ceiling: rationale must not invent unsupported attributes in >1% of sampled outputs
Privacy: raw user text from minors and enterprise accounts cannot leave device without explicit consent
Reliability: feature should degrade gracefully offline and under high server load

Available Resources

1.8M historical labeled examples across the 120 categories, with class imbalance
A candidate on-device model budget of <250MB and <2.5W peak power draw
Access to a hosted LLM API for server inference and a smaller distilled local model for on-device inference
Mobile telemetry, feature flags, and an existing evaluation pipeline for offline classification benchmarks
Optional metadata at inference time: locale, app surface, and whether OCR was used

Task

Propose how you would decide between on-device, server-side, or a hybrid routing approach for categorization, explicitly defining the evaluation plan before the architecture.
Design the prompting / output contract so the model returns a valid category, calibrated confidence, and a rationale without leaking unsupported claims.
Describe the serving architecture, including fallback behavior for offline mode, poor connectivity, and low-confidence predictions.
Specify offline and online evaluation, including quality, hallucination, latency, cost, battery, and safety metrics.
Identify key failure modes such as prompt injection in user text, privacy leakage, and long-tail category errors, and explain mitigations.

Constraints

p95 end-to-end latency: <300ms on modern devices, <800ms on low-end devices or poor networks

Monthly inference budget: <$25K at 8M categorizations/month

Quality bar: macro-F1 0.88 on English,  0.80 on top 5 non-English locales

Hallucination ceiling: rationale must not invent unsupported attributes in >1% of sampled outputs

Privacy: raw user text from minors and enterprise accounts cannot leave device without explicit consent

Reliability: feature should degrade gracefully offline and under high server load

Available Resources

1.8M historical labeled examples across the 120 categories, with class imbalance

A candidate on-device model budget of <250MB and <2.5W peak power draw

Access to a hosted LLM API for server inference and a smaller distilled local model for on-device inference

Mobile telemetry, feature flags, and an existing evaluation pipeline for offline classification benchmarks

Optional metadata at inference time: locale, app surface, and whether OCR was used

Task

Propose how you would decide between on-device, server-side, or a hybrid routing approach for categorization, explicitly defining the evaluation plan before the architecture.

Design the prompting / output contract so the model returns a valid category, calibrated confidence, and a rationale without leaking unsupported claims.

Describe the serving architecture, including fallback behavior for offline mode, poor connectivity, and low-confidence predictions.

Specify offline and online evaluation, including quality, hallucination, latency, cost, battery, and safety metrics.

Identify key failure modes such as prompt injection in user text, privacy leakage, and long-tail category errors, and explain mitigations.

Constraints

p95 end-to-end latency: <300ms on modern devices, <800ms on low-end devices or poor networks

Monthly inference budget: <$25K at 8M categorizations/month

Quality bar: macro-F1 0.88 on English,  0.80 on top 5 non-English locales

Hallucination ceiling: rationale must not invent unsupported attributes in >1% of sampled outputs

Privacy: raw user text from minors and enterprise accounts cannot leave device without explicit consent

Reliability: feature should degrade gracefully offline and under high server load

Available Resources

1.8M historical labeled examples across the 120 categories, with class imbalance

A candidate on-device model budget of <250MB and <2.5W peak power draw

Access to a hosted LLM API for server inference and a smaller distilled local model for on-device inference

Mobile telemetry, feature flags, and an existing evaluation pipeline for offline classification benchmarks

Optional metadata at inference time: locale, app surface, and whether OCR was used

Task

Propose how you would decide between on-device, server-side, or a hybrid routing approach for categorization, explicitly defining the evaluation plan before the architecture.

Design the prompting / output contract so the model returns a valid category, calibrated confidence, and a rationale without leaking unsupported claims.

Describe the serving architecture, including fallback behavior for offline mode, poor connectivity, and low-confidence predictions.

Specify offline and online evaluation, including quality, hallucination, latency, cost, battery, and safety metrics.

Identify key failure modes such as prompt injection in user text, privacy leakage, and long-tail category errors, and explain mitigations.

Constraints

p95 end-to-end latency: <300ms on modern devices, <800ms on low-end devices or poor networks

Monthly inference budget: <$25K at 8M categorizations/month

Quality bar: macro-F1 0.88 on English,  0.80 on top 5 non-English locales

Hallucination ceiling: rationale must not invent unsupported attributes in >1% of sampled outputs

Privacy: raw user text from minors and enterprise accounts cannot leave device without explicit consent

Reliability: feature should degrade gracefully offline and under high server load

Available Resources

1.8M historical labeled examples across the 120 categories, with class imbalance

A candidate on-device model budget of <250MB and <2.5W peak power draw

Access to a hosted LLM API for server inference and a smaller distilled local model for on-device inference

Mobile telemetry, feature flags, and an existing evaluation pipeline for offline classification benchmarks

Optional metadata at inference time: locale, app surface, and whether OCR was used

Task

Propose how you would decide between on-device, server-side, or a hybrid routing approach for categorization, explicitly defining the evaluation plan before the architecture.

Design the prompting / output contract so the model returns a valid category, calibrated confidence, and a rationale without leaking unsupported claims.

Describe the serving architecture, including fallback behavior for offline mode, poor connectivity, and low-confidence predictions.

Specify offline and online evaluation, including quality, hallucination, latency, cost, battery, and safety metrics.

Identify key failure modes such as prompt injection in user text, privacy leakage, and long-tail category errors, and explain mitigations.

Interview Guides

Context

Constraints

Available Resources

Task

Choose Mobile Inference for Categorization

Context

Constraints

Available Resources

Task

Your Answer

Choose Mobile Inference for Categorization

Context

Constraints

Available Resources

Task

Choose Mobile Inference for Categorization

Context

Constraints

Available Resources

Task

Your Answer