Product Context
HelpHub is a SaaS customer support platform used by mid-market e-commerce merchants. Today, most incoming support tickets are manually reviewed by agents; the company wants to move toward ML-assisted triage first, then fully automated resolution for low-risk cases.
Scale
| Signal | Value |
|---|
| DAU (end customers creating tickets) | 18M |
| Support agents | 42K |
| Peak ticket creation QPS | 9K |
| Tickets per day | 220M |
| Historical resolved tickets | 14B |
| Active help-center articles / macros | 3.5M |
| End-to-end decision latency budget | 350ms p99 |
Task
Design an end-to-end ML system that decides whether a ticket should be:
- routed to a human,
- shown an agent-assist recommendation, or
- fully auto-resolved with a suggested action or response.
Your design should address:
- Requirements and scope: define what decisions are automated vs kept human-in-the-loop, and what success means.
- System architecture: propose the online and offline architecture, including retrieval, ranking, and final decisioning.
- Modeling choices: choose models for candidate retrieval, ranking, and automation eligibility, with clear tradeoffs.
- Data and training: define labels, feedback loops, feature pipelines, retraining cadence, and how you avoid training-serving skew.
- Evaluation and launch: explain offline metrics, online experimentation, guardrails, and a staged rollout from assistive to autonomous mode.
- Failure modes: identify key risks such as bad auto-resolutions, feature drift, policy violations, and outages.
Constraints
- Only low-risk intents (refund status, order tracking, password reset, FAQ-style issues) are eligible for full automation at launch.
- Some tickets contain PII and payment data; raw text retention is limited by compliance policy.
- Merchants require auditability: every automated action must log the evidence and model version used.
- Cost target is under $0.004 per resolved ticket on average, so expensive LLM calls cannot be used on every request.
- New policies and macros are updated hourly, so the system must handle freshness without full retraining for every change.