Business Context
ApexAssist is deploying an internal LLM-powered support copilot for customer service agents. The first prototype works in offline demos, but production rollout is blocked by high latency, rising inference cost, unstable output quality, and operational reliability issues.
Data
- Volume: ~2.5M historical support conversations, plus 80K new prompts per day
- Text length: user prompts range from 20-1,500 tokens; retrieved context adds 200-3,000 tokens
- Language: English only
- Label distribution: 4 bottleneck classes from incident reviews —
latency (35%), cost (25%), quality (22%), reliability/safety (18%)
- Input format: multi-turn chat transcripts, system prompts, retrieval snippets, and model responses
Success Criteria
A strong solution should identify the main LLM deployment bottlenecks from logs and prompt traces, classify incidents correctly, and propose practical mitigations. Target macro-F1 >= 0.84, recall >= 0.90 on reliability/safety, and an inference pipeline that supports near-real-time triage of production incidents.
Constraints
- Inference latency for the classifier must stay under 120 ms per incident on a single T4 GPU
- No raw customer PII may be stored in training artifacts
- The solution must be explainable enough for platform and SRE teams to act on predictions
- Training should fit within a standard Python/Transformers stack
Requirements
- Build an NLP system that classifies each deployment incident into the primary LLM bottleneck category.
- Design preprocessing for chat logs, prompts, retrieval context, and structured metadata.
- Fine-tune a modern transformer model in Python and justify architecture choices.
- Define how you would evaluate classification quality and operational usefulness.
- For each predicted bottleneck class, describe concrete remediation actions such as quantization, batching, prompt compression, caching, guardrails, or fallback routing.