Product Context
HelpFlow is a SaaS customer support platform used by large e-commerce and fintech companies. Design an AI-driven support system that can understand incoming user issues, retrieve relevant knowledge, rank next-best actions, and decide whether to answer automatically, suggest an agent reply, or route to a human specialist.
Scale
| Signal | Value |
|---|
| End customers served | 60M monthly active end users |
| Enterprise agents | 85K active agents |
| Peak inbound support requests | 18K QPS across chat + email ingestion |
| Concurrent live chat sessions | 1.2M |
| Knowledge base size | 45M help articles, macros, prior resolved tickets |
| New/updated documents per day | 3.5M |
| p99 latency budget for live chat assist | 700ms end-to-end |
| Auto-resolution target | 35% of eligible tickets |
Task
- Clarify the product requirements and define what decisions the ML system makes versus what remains rule-based or human-controlled.
- Propose a scalable multi-stage architecture for intake, retrieval, ranking, re-ranking, and final action selection.
- Choose models for each stage and explain tradeoffs across quality, latency, and cost.
- Design the offline and online data pipelines, including labels, feedback loops, and feature storage.
- Define evaluation, experimentation, monitoring, and rollback strategy.
- Identify major failure modes, especially feature drift, training-serving skew, and unsafe or low-confidence automation.
Constraints
- Live chat responses must meet p99 < 700ms; agent-assist suggestions should ideally appear in < 300ms.
- Some tenants require data isolation and cannot share raw ticket text across customers.
- Personally identifiable information must be redacted before long-term storage or model training.
- Knowledge content changes frequently; freshness matters for refunds, policy changes, and outages.
- Wrong auto-responses are costly, so the system must support confidence thresholds and safe fallback to humans.