Product Context
AssistFlow is a client-facing AI support platform used by enterprise customers. End users type a question in a web or mobile chat, and the system must retrieve relevant knowledge, rank candidate answers, and return a high-quality response with low latency.
Scale
| Signal | Value |
|---|
| DAU | 12M end users |
| Peak QPS | 45K chat turns/sec |
| Enterprise tenants | 18K |
| Knowledge corpus | 220M documents/snippets across tenants |
| New/updated documents | 9M per day |
| p99 latency budget | 900ms end-to-end |
| Context window for final response | up to 16K tokens |
Task
Design an end-to-end ML system for this application, from ingestion through production monitoring. Your design should address:
- How data is ingested, cleaned, versioned, and made available for training and serving
- A multi-stage online architecture for retrieval, ranking, and optional final re-ranking or answer synthesis
- Model choices for each stage, including how you handle tenant isolation, freshness, and cold start
- Training, offline evaluation, and online experimentation strategy
- Monitoring, alerting, rollback, and failure handling in production
Constraints
- Tenant data must remain logically isolated; cross-tenant leakage is a hard compliance failure
- 35% of documents are updated within 24 hours, so freshness matters
- GPU budget is limited: average inference cost must stay below $0.012 per request
- Some customers require explainability: the system must expose citations and confidence signals
- The product must degrade gracefully if the generative stage is unavailable; retrieval-only answers are acceptable
- Training labels are partially delayed and noisy because many users do not explicitly rate responses