Product Context
AtlasCRM wants to add AI-driven next-best-action recommendations for enterprise sales reps inside its web app. The feature suggests which accounts to contact, what risk signals to review, and which upsell actions to take based on account activity, CRM history, and recent product usage.
Scale
| Signal | Value |
|---|
| Enterprise seats | 12M licensed users |
| DAU | 2.5M sales reps and managers |
| Peak QPS | 28K recommendation requests/sec during business hours |
| Customer accounts in graph | 180M accounts |
| Events/day | 9B CRM + product usage events |
| Per-request latency budget (p99) | 250ms |
| Freshness target for critical signals | < 5 minutes |
Task
Design an end-to-end ML system for this feature, with special focus on when to use batch computation vs real-time inference/features.
- Clarify the product requirements and define which outputs must be personalized, fresh, and explainable.
- Propose a multi-stage architecture for candidate generation, ranking, and optional re-ranking/policy enforcement.
- Decide which features and predictions should be precomputed in batch versus computed online in real time, and justify the split.
- Design the training, feature store, and serving architecture, including how you avoid training-serving skew.
- Define offline and online evaluation, rollout strategy, and monitoring for drift, freshness, and business-impact regressions.
- Identify key failure modes, especially around stale features, missing streaming data, and tenant-specific behavior differences.
Constraints
- Recommendations must be tenant-isolated; no cross-customer data leakage is allowed.
- Enterprise admins require human-readable reasons for each recommendation.
- Cost matters: average inference + feature lookup cost must stay under $0.002 per request.
- Some signals arrive in streams within seconds; others are only available from nightly warehouse jobs.
- The system must degrade gracefully if the streaming pipeline is delayed or unavailable.