Context
ShopSnap is adding a smart categorization feature to its mobile app. When a user saves a note, receipt, photo caption, or short text snippet, the app should assign one of 120 product-defined categories and return a confidence score plus a short rationale for debugging.
Constraints
- p95 end-to-end latency: <300ms on modern devices, <800ms on low-end devices or poor networks
- Monthly inference budget: <$25K at 8M categorizations/month
- Quality bar: macro-F1 0.88 on English, 0.80 on top 5 non-English locales
- Hallucination ceiling: rationale must not invent unsupported attributes in >1% of sampled outputs
- Privacy: raw user text from minors and enterprise accounts cannot leave device without explicit consent
- Reliability: feature should degrade gracefully offline and under high server load
Available Resources
- 1.8M historical labeled examples across the 120 categories, with class imbalance
- A candidate on-device model budget of <250MB and <2.5W peak power draw
- Access to a hosted LLM API for server inference and a smaller distilled local model for on-device inference
- Mobile telemetry, feature flags, and an existing evaluation pipeline for offline classification benchmarks
- Optional metadata at inference time: locale, app surface, and whether OCR was used
Task
- Propose how you would decide between on-device, server-side, or a hybrid routing approach for categorization, explicitly defining the evaluation plan before the architecture.
- Design the prompting / output contract so the model returns a valid category, calibrated confidence, and a rationale without leaking unsupported claims.
- Describe the serving architecture, including fallback behavior for offline mode, poor connectivity, and low-confidence predictions.
- Specify offline and online evaluation, including quality, hallucination, latency, cost, battery, and safety metrics.
- Identify key failure modes such as prompt injection in user text, privacy leakage, and long-tail category errors, and explain mitigations.