Product Context
OpsPilot sells AI workflow automation to mid-market and enterprise clients in logistics, insurance, and back-office operations. The product ingests tasks from a client’s existing systems, recommends next actions to human operators, and can optionally auto-execute low-risk actions through the client’s operational tools.
Scale
| Signal | Value |
|---|
| Enterprise clients | 1,200 |
| Daily active operators | 3.5M |
| Peak workflow events | 180K QPS |
| Tasks processed per day | 2.2B |
| Historical task/action records | 14B |
| Average candidate actions per task | 200-2,000 |
| p99 online decision latency | 250ms |
| New client onboarding target | < 4 weeks |
Task
Design the lifecycle and production architecture for deploying an AI decisioning system into a client's existing operational workflow.
Your design should address:
- How you would clarify requirements and define success for both OpsPilot and the client, including human-in-the-loop vs full automation boundaries.
- A multi-stage ML architecture for generating candidate actions, ranking them, and re-ranking or filtering them using business rules, compliance constraints, and client-specific policies.
- The offline and online system design: data ingestion from client systems, feature computation, training, model deployment, online serving, feedback logging, and rollback paths.
- How you would handle client heterogeneity: different schemas, sparse labels for new clients, cold start, and varying workflow volumes.
- Your evaluation plan, including offline metrics, online experimentation, operational KPIs, and how you would detect feature drift, training-serving skew, and unsafe automation.
- The top failure modes you expect during rollout and how the system should degrade safely.
Constraints
- Many clients require PII minimization, audit logs, and region-specific data residency.
- Some workflows are high-risk, so the system must support confidence thresholds and mandatory human approval.
- Client source systems are inconsistent; some provide streaming events, others only daily batch exports.
- Inference cost must stay below $0.002 per task on average.
- New policies or workflow rules may need to take effect within 15 minutes without full model retraining.