Product Context
Sparksoft Support Cloud routes inbound customer conversations to the best help content, automation, or human queue. The platform serves enterprise support teams that need fast responses while keeping inference and infrastructure costs predictable.
Scale
| Signal | Value |
|---|
| Enterprise agents supported | 85K |
| End customers served monthly | 120M |
| Peak inbound conversation QPS | 18K requests/sec |
| Daily support events | 900M messages, clicks, status changes |
| Knowledge base size | 14M articles/macros/past resolutions |
| Active routing targets | 35K queues, bots, workflows |
| p99 latency budget | 180ms end-to-end |
| Availability target | 99.95% |
Task
Design an end-to-end ML system for Sparksoft Support Cloud that, for each incoming customer message, selects the best next action: retrieve relevant help content, rank likely resolution paths, and optionally re-rank for business rules such as SLA priority, language, and compliance. Your design should explicitly balance cost, performance, and reliability rather than optimizing only model quality.
Address the following:
- Clarify the product objective, prediction target, and success metrics for automated support triage.
- Propose a multi-stage architecture (retrieval → ranking → re-ranking) and explain which parts run online vs. batch.
- Size the system and give a latency and cost budget across stages, including feature serving and model inference.
- Choose models for each stage and justify why they fit the scale and cost constraints.
- Define offline and online evaluation, including how you would measure cost-efficiency and guard against regressions.
- Identify key failure modes such as feature drift, training-serving skew, stale indexes, and degraded fallbacks.
Constraints
- 40% of requests come from the top 200 enterprise tenants, creating strong traffic skew.
- New knowledge-base articles must become retrievable within 10 minutes.
- Some tenants prohibit cross-tenant training data leakage and require regional data residency.
- GPU capacity is limited; most serving must run on CPU, with selective use of heavier models.
- The system must degrade gracefully to deterministic routing rules if ML components are slow or unavailable.