Business Context
ApexBank wants a central orchestration layer for enterprise LLM applications used by support, legal, and internal search teams. The platform must route each incoming request to the right model and toolchain while enforcing cost, latency, and compliance constraints.
Data Characteristics
- Volume: ~2M historical prompts and responses, plus 80K new requests per day
- Input types: user prompts, conversation history, metadata, retrieval context, tool outputs
- Text length: 10-2,000 tokens per request; median 220 tokens
- Language: English primarily, with 12% multilingual traffic
- Labels available: routing target, task type, escalation outcome, user feedback, policy violations
- Class distribution: highly imbalanced; common tasks are FAQ/search, rare tasks are legal review and high-risk compliance drafting
Success Criteria
A good solution should achieve at least 90% routing accuracy on known task classes, reduce average inference cost by 25% versus always calling the largest model, and keep p95 end-to-end latency under 2 seconds for standard requests.
Constraints
- No sensitive data may leave the approved VPC
- All prompts and outputs must be logged for auditability
- High-risk requests must be sent to approved models only
- The system must degrade gracefully if a model or retrieval service is unavailable
Requirements
- Design an NLP pipeline that classifies each request by intent, risk, and orchestration path.
- Describe preprocessing for prompts, chat history, metadata, and retrieved documents.
- Build a routing model that selects among small, large, and domain-specific LLM backends.
- Include fallback logic, policy checks, and retrieval augmentation in the orchestration flow.
- Provide a Python implementation for preprocessing, training, inference, and evaluation.
- Explain how you would monitor drift, routing quality, latency, and policy failures in production.