Classify and Cluster Vectra Detections

Business Context

Vectra AI wants to improve triage in the Vectra AI Platform by both classifying known attack behaviors and surfacing novel patterns in detections that do not yet have reliable labels. You need to show the practical difference between supervised and unsupervised learning using the same security dataset.

Dataset

You are given historical detection-level telemetry exported from the Vectra AI Platform.

Feature Group	Count	Examples
Detection scores	6	certainty_score, threat_score, triage_priority, host_risk_score
Entity behavior	10	failed_logins_24h, lateral_movement_events_7d, beaconing_count_24h, rare_process_count
Asset context	7	device_type, identity_type, business_unit, crown_jewel_flag
Network context	8	bytes_outbound_1h, unique_dst_ips_24h, external_conn_ratio, protocol_entropy
Temporal features	5	hour_of_day, day_of_week, time_since_first_seen, burstiness_index

Rows: 240K detections collected over 9 months
Target available for subset: analyst_validated_label = malicious (1) / benign (0)
Labeled subset: 72K rows; remaining 168K rows are unlabeled
Class balance in labeled data: 11% malicious, 89% benign
Missing data: 8% missing in asset context, 3% missing in network features, higher missingness for newly observed hosts

Success Criteria

A strong solution should:

Achieve AUC-ROC >= 0.88 and recall >= 0.75 at precision >= 0.50 on the labeled test set for the supervised model
Produce unsupervised clusters with silhouette score >= 0.20 and a clear interpretation of at least 3 cluster types
Clearly explain when Vectra AI should use supervised classification vs unsupervised clustering in production

Constraints

Batch scoring must complete within 15 minutes for 300K daily detections
Security analysts need interpretable outputs, not just raw scores
Labels are incomplete and may lag by days or weeks

Deliverables

Train a supervised model to classify malicious vs benign detections
Train an unsupervised model to group detections without labels
Compare the two approaches, including data requirements, outputs, and failure modes
Recommend how both models should be used together in the Vectra AI Platform
Report evaluation metrics and key feature or cluster insights

Business Context

Dataset

You are given historical detection-level telemetry exported from the Vectra AI Platform.

Feature Group	Count	Examples
Detection scores	6	certainty_score, threat_score, triage_priority, host_risk_score
Entity behavior	10	failed_logins_24h, lateral_movement_events_7d, beaconing_count_24h, rare_process_count
Asset context	7	device_type, identity_type, business_unit, crown_jewel_flag
Network context	8	bytes_outbound_1h, unique_dst_ips_24h, external_conn_ratio, protocol_entropy
Temporal features	5	hour_of_day, day_of_week, time_since_first_seen, burstiness_index

Rows: 240K detections collected over 9 months
Target available for subset: analyst_validated_label = malicious (1) / benign (0)
Labeled subset: 72K rows; remaining 168K rows are unlabeled
Class balance in labeled data: 11% malicious, 89% benign
Missing data: 8% missing in asset context, 3% missing in network features, higher missingness for newly observed hosts

Success Criteria

A strong solution should:

Achieve AUC-ROC >= 0.88 and recall >= 0.75 at precision >= 0.50 on the labeled test set for the supervised model
Produce unsupervised clusters with silhouette score >= 0.20 and a clear interpretation of at least 3 cluster types
Clearly explain when Vectra AI should use supervised classification vs unsupervised clustering in production

Constraints

Batch scoring must complete within 15 minutes for 300K daily detections
Security analysts need interpretable outputs, not just raw scores
Labels are incomplete and may lag by days or weeks

Deliverables

Train a supervised model to classify malicious vs benign detections
Train an unsupervised model to group detections without labels
Compare the two approaches, including data requirements, outputs, and failure modes
Recommend how both models should be used together in the Vectra AI Platform
Report evaluation metrics and key feature or cluster insights

Business Context

Dataset

You are given historical detection-level telemetry exported from the Vectra AI Platform.

Feature Group	Count	Examples
Detection scores	6	certainty_score, threat_score, triage_priority, host_risk_score
Entity behavior	10	failed_logins_24h, lateral_movement_events_7d, beaconing_count_24h, rare_process_count
Asset context	7	device_type, identity_type, business_unit, crown_jewel_flag
Network context	8	bytes_outbound_1h, unique_dst_ips_24h, external_conn_ratio, protocol_entropy
Temporal features	5	hour_of_day, day_of_week, time_since_first_seen, burstiness_index

Rows: 240K detections collected over 9 months
Target available for subset: analyst_validated_label = malicious (1) / benign (0)
Labeled subset: 72K rows; remaining 168K rows are unlabeled
Class balance in labeled data: 11% malicious, 89% benign
Missing data: 8% missing in asset context, 3% missing in network features, higher missingness for newly observed hosts

Success Criteria

A strong solution should:

Achieve AUC-ROC >= 0.88 and recall >= 0.75 at precision >= 0.50 on the labeled test set for the supervised model
Produce unsupervised clusters with silhouette score >= 0.20 and a clear interpretation of at least 3 cluster types
Clearly explain when Vectra AI should use supervised classification vs unsupervised clustering in production

Constraints

Batch scoring must complete within 15 minutes for 300K daily detections
Security analysts need interpretable outputs, not just raw scores
Labels are incomplete and may lag by days or weeks

Deliverables

Train a supervised model to classify malicious vs benign detections
Train an unsupervised model to group detections without labels
Compare the two approaches, including data requirements, outputs, and failure modes
Recommend how both models should be used together in the Vectra AI Platform
Report evaluation metrics and key feature or cluster insights

Business Context

Dataset

You are given historical detection-level telemetry exported from the Vectra AI Platform.

Feature Group	Count	Examples
Detection scores	6	certainty_score, threat_score, triage_priority, host_risk_score
Entity behavior	10	failed_logins_24h, lateral_movement_events_7d, beaconing_count_24h, rare_process_count
Asset context	7	device_type, identity_type, business_unit, crown_jewel_flag
Network context	8	bytes_outbound_1h, unique_dst_ips_24h, external_conn_ratio, protocol_entropy
Temporal features	5	hour_of_day, day_of_week, time_since_first_seen, burstiness_index

Rows: 240K detections collected over 9 months
Target available for subset: analyst_validated_label = malicious (1) / benign (0)
Labeled subset: 72K rows; remaining 168K rows are unlabeled
Class balance in labeled data: 11% malicious, 89% benign
Missing data: 8% missing in asset context, 3% missing in network features, higher missingness for newly observed hosts

Success Criteria

A strong solution should:

Achieve AUC-ROC >= 0.88 and recall >= 0.75 at precision >= 0.50 on the labeled test set for the supervised model
Produce unsupervised clusters with silhouette score >= 0.20 and a clear interpretation of at least 3 cluster types
Clearly explain when Vectra AI should use supervised classification vs unsupervised clustering in production

Constraints

Batch scoring must complete within 15 minutes for 300K daily detections
Security analysts need interpretable outputs, not just raw scores
Labels are incomplete and may lag by days or weeks

Deliverables

Train a supervised model to classify malicious vs benign detections
Train an unsupervised model to group detections without labels
Compare the two approaches, including data requirements, outputs, and failure modes
Recommend how both models should be used together in the Vectra AI Platform
Report evaluation metrics and key feature or cluster insights

Interview Guides

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Classify and Cluster Vectra Detections

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Your Answer

Classify and Cluster Vectra Detections

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Classify and Cluster Vectra Detections

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Your Answer