Product Context
OpenAI Operations handles a large internal queue of support and policy-review cases across surfaces such as ChatGPT, API platform operations, billing disputes, account access, and trust-and-safety escalations. Today, many cases are manually reviewed; design an ML system that moves this workflow from human review to ML-assisted triage and, where safe, fully automated resolution.
Scale
| Signal | Value |
|---|
| Daily active users generating support demand | 25M |
| New cases per day | 1.8M |
| Peak case-ingest QPS | 450 |
| Historical resolved cases | 650M |
| Distinct resolution playbooks / actions | 1,200 |
| p99 latency budget for assisted recommendation | 800ms |
| p99 latency budget for fully automated decision | 300ms |
Task
- Clarify the product scope: which case types should remain human-only, ML-assisted, or fully automated.
- Design the end-to-end architecture, including intake, candidate resolution retrieval, ranking, re-ranking / policy checks, and action execution.
- Choose models for each stage and explain online vs batch inference, feature freshness, and how human feedback is incorporated.
- Define offline and online evaluation, including precision requirements for auto-resolution and guardrails for customer harm.
- Identify failure modes such as feature drift, training-serving skew, stale policies, and unsafe automation, with detection and mitigation.
Constraints
- Some actions are high-risk (account suspension, refunds above a threshold, legal/privacy requests) and require strict precision or mandatory human approval.
- Resolution policies change weekly; the system must adapt quickly without retraining everything from scratch.
- Auditability is required: every recommendation or automated action must be explainable and logged.
- Cost matters: the majority of cases should be handled on CPU-first infrastructure, with selective use of heavier models.
- Personally identifiable information must be minimized in features and logs.