Product Context
AcmeCloud is a B2B SaaS platform with a large customer support operation. You are designing an AI system that reads incoming support tickets and recommends priority, routing, and likely resolution suggestions so agents can respond faster and high-severity issues are handled first.
Scale
| Signal | Value |
|---|
| Monthly active customers | 3M |
| Support agents | 12,000 |
| New tickets per day | 18M |
| Peak ticket ingest QPS | 2,500 |
| Historical ticket corpus | 4B tickets |
| Candidate knowledge articles/macros | 25M |
| Per-ticket online latency budget (p99) | 300ms |
Task
- Clarify the product goal and define the prediction targets: priority, team routing, and recommended resolution candidates.
- Design an end-to-end ML system, including data pipelines, feature computation, model training, and online serving.
- Propose a multi-stage architecture for retrieval, ranking, and optional re-ranking of suggested resolutions or macros.
- Explain how you would handle cold-start tickets, sparse customer history, multilingual text, and rapidly changing incident patterns.
- Define offline and online evaluation, including business metrics, guardrails, and rollout strategy.
- Identify key failure modes such as feature drift, training-serving skew, stale knowledge content, and misrouting of urgent tickets.
Constraints
- High-severity enterprise outage tickets must be prioritized with very high recall.
- Some ticket metadata arrives late or is incomplete at creation time.
- The system must support 20 languages, but only 6 have abundant labeled data.
- PII in ticket text cannot be exposed to downstream analytics without redaction.
- Serving cost matters: the default path should run on CPU, with limited GPU use for heavier models or batch jobs.