Business Context
LexiOps, an internal AI platform for a large SaaS company, serves employee requests such as policy Q&A, code help, incident summaries, and contract drafting. The platform currently sends every request to the same large model, creating unnecessary cost and latency; the team wants an orchestration layer that routes each request to the right model and tool chain.
Data
- Volume: 3.5M historical prompts and responses, plus ~120K new requests per day
- Text length: 5-2,000 tokens per request; median 180 tokens
- Language: English (95%), mixed multilingual content in the remainder
- Labels: Historical traces include task type, selected model/tool path, latency, cost, user rating, and fallback events
- Class distribution: Heavily skewed toward general Q&A and summarization; low-frequency classes include legal drafting and incident analysis
Success Criteria
A good solution should route requests to the correct orchestration path with >= 88% macro-F1 on offline evaluation, reduce average inference cost by >= 30%, keep p95 routing latency under 50 ms, and preserve answer quality within 2 percentage points of the current single-model baseline.
Constraints
- Routing must run before generation, so the classifier must be lightweight
- Some requests require retrieval, some require function calling, and some must be blocked or escalated
- Sensitive prompts cannot be sent to external APIs
- The system must support fallback when the first selected model fails or exceeds latency budget
Requirements
- Build an NLP routing system that classifies each incoming request into an orchestration path.
- Design preprocessing for noisy prompts, code snippets, markdown, and multilingual text.
- Implement a modern Python solution using transformers for routing classification.
- Include model training, evaluation, and a simple rule-based policy layer for constraints.
- Explain how you would connect the classifier to downstream LLM, RAG, and tool-calling workflows.