Detect Malicious Bots in WAF Traffic

Business Context

ShieldGate operates a cloud Web Application Firewall (WAF) protecting 12,000 customer web applications and processing roughly 180 million HTTP requests per day. The security team wants a machine learning classifier that distinguishes legitimate human traffic from malicious bots so the WAF can block or challenge suspicious requests without hurting real users.

Dataset

You are given request-level logs aggregated from 30 days of WAF traffic. Each row represents one HTTP request with engineered session and network context available at scoring time.

Feature Group	Count	Examples
Request metadata	12	method, path_depth, query_length, content_type, response_status
Header and client signals	10	user_agent_family, accept_language_present, cookie_count, header_count
Behavioral features	11	requests_per_minute_ip, inter_request_time_ms, repeated_path_ratio, session_duration_sec
Network and reputation	8	asn, country, ip_reputation_score, proxy_flag, datacenter_ip_flag
Browser consistency	6	ua_os_browser_match, js_challenge_passed, tls_fingerprint_rarity

Size: 4.8M requests, 47 features
Target: Binary label — malicious bot (1) vs legitimate user/request (0)
Class balance: 7.4% malicious, 92.6% legitimate
Missing data: 18% missing in JavaScript/browser-derived fields, 6% missing in reputation fields for newly seen IPs

Success Criteria

A good solution should achieve strong bot recall while keeping false positives low enough for production use. Target at least 90% recall on malicious bots with precision >= 70% on the blocked/challenged class, and support threshold tuning for different customer risk profiles.

Constraints

Inference latency must stay under 15 ms per request at the model layer.
The model must be explainable enough for security analysts to review top risk factors.
Features must be available in real time; no future session information or post-request labels may be used.
The model should be retrained weekly because bot behavior changes quickly.

Deliverables

Build a binary classification pipeline for bot detection.
Explain which features help separate legitimate users from malicious bots.
Choose an evaluation strategy appropriate for imbalanced security data.
Propose a decision threshold for block vs challenge vs allow actions.
Describe how you would monitor drift and model degradation after deployment.

Business Context

Dataset

You are given request-level logs aggregated from 30 days of WAF traffic. Each row represents one HTTP request with engineered session and network context available at scoring time.

Feature Group	Count	Examples
Request metadata	12	method, path_depth, query_length, content_type, response_status
Header and client signals	10	user_agent_family, accept_language_present, cookie_count, header_count
Behavioral features	11	requests_per_minute_ip, inter_request_time_ms, repeated_path_ratio, session_duration_sec
Network and reputation	8	asn, country, ip_reputation_score, proxy_flag, datacenter_ip_flag
Browser consistency	6	ua_os_browser_match, js_challenge_passed, tls_fingerprint_rarity

Size: 4.8M requests, 47 features
Target: Binary label — malicious bot (1) vs legitimate user/request (0)
Class balance: 7.4% malicious, 92.6% legitimate
Missing data: 18% missing in JavaScript/browser-derived fields, 6% missing in reputation fields for newly seen IPs

Success Criteria

Constraints

Inference latency must stay under 15 ms per request at the model layer.
The model must be explainable enough for security analysts to review top risk factors.
Features must be available in real time; no future session information or post-request labels may be used.
The model should be retrained weekly because bot behavior changes quickly.

Deliverables

Build a binary classification pipeline for bot detection.
Explain which features help separate legitimate users from malicious bots.
Choose an evaluation strategy appropriate for imbalanced security data.
Propose a decision threshold for block vs challenge vs allow actions.
Describe how you would monitor drift and model degradation after deployment.

Business Context

Dataset

You are given request-level logs aggregated from 30 days of WAF traffic. Each row represents one HTTP request with engineered session and network context available at scoring time.

Feature Group	Count	Examples
Request metadata	12	method, path_depth, query_length, content_type, response_status
Header and client signals	10	user_agent_family, accept_language_present, cookie_count, header_count
Behavioral features	11	requests_per_minute_ip, inter_request_time_ms, repeated_path_ratio, session_duration_sec
Network and reputation	8	asn, country, ip_reputation_score, proxy_flag, datacenter_ip_flag
Browser consistency	6	ua_os_browser_match, js_challenge_passed, tls_fingerprint_rarity

Size: 4.8M requests, 47 features
Target: Binary label — malicious bot (1) vs legitimate user/request (0)
Class balance: 7.4% malicious, 92.6% legitimate
Missing data: 18% missing in JavaScript/browser-derived fields, 6% missing in reputation fields for newly seen IPs

Success Criteria

Constraints

Inference latency must stay under 15 ms per request at the model layer.
The model must be explainable enough for security analysts to review top risk factors.
Features must be available in real time; no future session information or post-request labels may be used.
The model should be retrained weekly because bot behavior changes quickly.

Deliverables

Build a binary classification pipeline for bot detection.
Explain which features help separate legitimate users from malicious bots.
Choose an evaluation strategy appropriate for imbalanced security data.
Propose a decision threshold for block vs challenge vs allow actions.
Describe how you would monitor drift and model degradation after deployment.

Business Context

Dataset

You are given request-level logs aggregated from 30 days of WAF traffic. Each row represents one HTTP request with engineered session and network context available at scoring time.

Feature Group	Count	Examples
Request metadata	12	method, path_depth, query_length, content_type, response_status
Header and client signals	10	user_agent_family, accept_language_present, cookie_count, header_count
Behavioral features	11	requests_per_minute_ip, inter_request_time_ms, repeated_path_ratio, session_duration_sec
Network and reputation	8	asn, country, ip_reputation_score, proxy_flag, datacenter_ip_flag
Browser consistency	6	ua_os_browser_match, js_challenge_passed, tls_fingerprint_rarity

Size: 4.8M requests, 47 features
Target: Binary label — malicious bot (1) vs legitimate user/request (0)
Class balance: 7.4% malicious, 92.6% legitimate
Missing data: 18% missing in JavaScript/browser-derived fields, 6% missing in reputation fields for newly seen IPs

Success Criteria

Constraints

Inference latency must stay under 15 ms per request at the model layer.
The model must be explainable enough for security analysts to review top risk factors.
Features must be available in real time; no future session information or post-request labels may be used.
The model should be retrained weekly because bot behavior changes quickly.

Deliverables

Build a binary classification pipeline for bot detection.
Explain which features help separate legitimate users from malicious bots.
Choose an evaluation strategy appropriate for imbalanced security data.
Propose a decision threshold for block vs challenge vs allow actions.
Describe how you would monitor drift and model degradation after deployment.

Interview Guides

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Detect Malicious Bots in WAF Traffic

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Your Answer

Detect Malicious Bots in WAF Traffic

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Detect Malicious Bots in WAF Traffic

Business Context

Dataset

Success Criteria

Constraints

Deliverables

Your Answer