Design Secure ML Threat Detection

Product Context

Sparksoft SecureEdge protects enterprise customers across Sparksoft Workspace, Sparksoft Identity, and Sparksoft Cloud Gateway. Design an end-to-end ML system that scores user and service activity in real time to detect risky sessions, credential abuse, and anomalous access patterns while meeting strict security and compliance requirements.

Scale

Signal	Value
Enterprise end users monitored	85M
Customer tenants	120K
Peak event ingest	900K events/sec
Real-time scoring QPS	180K requests/sec
Daily security events	22B
Distinct entities	85M users, 14M devices, 9M service accounts
p99 decision latency budget	120ms

Task

Clarify the threat-detection use cases, decision types, and security requirements for Sparksoft SecureEdge.
Design the full ML architecture, including event ingestion, candidate generation, ranking, and final policy decisioning.
Choose models for each stage and explain tradeoffs across accuracy, latency, interpretability, and adversarial robustness.
Define the offline and online data pipelines, including labels, delayed feedback, and feature consistency between training and serving.
Explain how you would secure the ML system itself: access control, feature/data protection, model abuse prevention, auditability, and safe fallback behavior.
Describe evaluation, monitoring, and top failure modes, especially feature drift, training-serving skew, and attacker adaptation.

Constraints

High-severity actions may trigger step-up auth or session block, so false positives are costly.
Raw PII and customer secrets cannot be exposed to downstream model consumers; data minimization is required.
Some customers require regional data residency and 30-day maximum retention for raw logs.
Analysts need reason codes and audit trails for every high-risk decision.
The system must continue operating in degraded mode if the ML ranker is unavailable.

Product Context

Scale

Signal	Value
Enterprise end users monitored	85M
Customer tenants	120K
Peak event ingest	900K events/sec
Real-time scoring QPS	180K requests/sec
Daily security events	22B
Distinct entities	85M users, 14M devices, 9M service accounts
p99 decision latency budget	120ms

Task

Clarify the threat-detection use cases, decision types, and security requirements for Sparksoft SecureEdge.
Design the full ML architecture, including event ingestion, candidate generation, ranking, and final policy decisioning.
Choose models for each stage and explain tradeoffs across accuracy, latency, interpretability, and adversarial robustness.
Define the offline and online data pipelines, including labels, delayed feedback, and feature consistency between training and serving.
Explain how you would secure the ML system itself: access control, feature/data protection, model abuse prevention, auditability, and safe fallback behavior.
Describe evaluation, monitoring, and top failure modes, especially feature drift, training-serving skew, and attacker adaptation.

Constraints

High-severity actions may trigger step-up auth or session block, so false positives are costly.
Raw PII and customer secrets cannot be exposed to downstream model consumers; data minimization is required.
Some customers require regional data residency and 30-day maximum retention for raw logs.
Analysts need reason codes and audit trails for every high-risk decision.
The system must continue operating in degraded mode if the ML ranker is unavailable.

Product Context

Scale

Signal	Value
Enterprise end users monitored	85M
Customer tenants	120K
Peak event ingest	900K events/sec
Real-time scoring QPS	180K requests/sec
Daily security events	22B
Distinct entities	85M users, 14M devices, 9M service accounts
p99 decision latency budget	120ms

Task

Clarify the threat-detection use cases, decision types, and security requirements for Sparksoft SecureEdge.
Design the full ML architecture, including event ingestion, candidate generation, ranking, and final policy decisioning.
Choose models for each stage and explain tradeoffs across accuracy, latency, interpretability, and adversarial robustness.
Define the offline and online data pipelines, including labels, delayed feedback, and feature consistency between training and serving.
Explain how you would secure the ML system itself: access control, feature/data protection, model abuse prevention, auditability, and safe fallback behavior.
Describe evaluation, monitoring, and top failure modes, especially feature drift, training-serving skew, and attacker adaptation.

Constraints

High-severity actions may trigger step-up auth or session block, so false positives are costly.
Raw PII and customer secrets cannot be exposed to downstream model consumers; data minimization is required.
Some customers require regional data residency and 30-day maximum retention for raw logs.
Analysts need reason codes and audit trails for every high-risk decision.
The system must continue operating in degraded mode if the ML ranker is unavailable.

Product Context

Scale

Signal	Value
Enterprise end users monitored	85M
Customer tenants	120K
Peak event ingest	900K events/sec
Real-time scoring QPS	180K requests/sec
Daily security events	22B
Distinct entities	85M users, 14M devices, 9M service accounts
p99 decision latency budget	120ms

Task

Clarify the threat-detection use cases, decision types, and security requirements for Sparksoft SecureEdge.
Design the full ML architecture, including event ingestion, candidate generation, ranking, and final policy decisioning.
Choose models for each stage and explain tradeoffs across accuracy, latency, interpretability, and adversarial robustness.
Define the offline and online data pipelines, including labels, delayed feedback, and feature consistency between training and serving.
Explain how you would secure the ML system itself: access control, feature/data protection, model abuse prevention, auditability, and safe fallback behavior.
Describe evaluation, monitoring, and top failure modes, especially feature drift, training-serving skew, and attacker adaptation.

Constraints

High-severity actions may trigger step-up auth or session block, so false positives are costly.
Raw PII and customer secrets cannot be exposed to downstream model consumers; data minimization is required.
Some customers require regional data residency and 30-day maximum retention for raw logs.
Analysts need reason codes and audit trails for every high-risk decision.
The system must continue operating in degraded mode if the ML ranker is unavailable.

Interview Guides

Product Context

Scale

Task

Constraints

Design Secure ML Threat Detection

Product Context

Scale

Task

Constraints

Your Answer

Design Secure ML Threat Detection

Product Context

Scale

Task

Constraints

Design Secure ML Threat Detection

Product Context

Scale

Task

Constraints

Your Answer