Product Context
Sparksoft SecureEdge protects enterprise customers across Sparksoft Workspace, Sparksoft Identity, and Sparksoft Cloud Gateway. Design an end-to-end ML system that scores user and service activity in real time to detect risky sessions, credential abuse, and anomalous access patterns while meeting strict security and compliance requirements.
Scale
| Signal | Value |
|---|
| Enterprise end users monitored | 85M |
| Customer tenants | 120K |
| Peak event ingest | 900K events/sec |
| Real-time scoring QPS | 180K requests/sec |
| Daily security events | 22B |
| Distinct entities | 85M users, 14M devices, 9M service accounts |
| p99 decision latency budget | 120ms |
Task
- Clarify the threat-detection use cases, decision types, and security requirements for Sparksoft SecureEdge.
- Design the full ML architecture, including event ingestion, candidate generation, ranking, and final policy decisioning.
- Choose models for each stage and explain tradeoffs across accuracy, latency, interpretability, and adversarial robustness.
- Define the offline and online data pipelines, including labels, delayed feedback, and feature consistency between training and serving.
- Explain how you would secure the ML system itself: access control, feature/data protection, model abuse prevention, auditability, and safe fallback behavior.
- Describe evaluation, monitoring, and top failure modes, especially feature drift, training-serving skew, and attacker adaptation.
Constraints
- High-severity actions may trigger step-up auth or session block, so false positives are costly.
- Raw PII and customer secrets cannot be exposed to downstream model consumers; data minimization is required.
- Some customers require regional data residency and 30-day maximum retention for raw logs.
- Analysts need reason codes and audit trails for every high-risk decision.
- The system must continue operating in degraded mode if the ML ranker is unavailable.