Detect Log Anomalies with LLMs

Context

Datadog-style observability startup LogPulse wants an LLM-powered layer in its incident pipeline to help SREs detect and summarize anomalies in application and infrastructure logs. The goal is not to replace statistical detectors, but to improve triage quality by extracting structured anomaly signals, grouping related events, and generating grounded incident summaries.

Constraints

p95 end-to-end latency: < 2,500ms per alert batch
Cost ceiling: < $0.015 per analyzed batch and < $18K/month at 40M log lines/day
Hallucination ceiling: < 2% unsupported root-cause claims on a labeled eval set
False-negative rate on critical incidents must not worsen by more than 1 percentage point versus the current rules-based baseline
Logs may contain PII, secrets, stack traces, and adversarial text; the system must not follow instructions embedded in logs
Output must be machine-readable JSON for downstream alerting and case management

Available Resources

Streaming logs from 2,000 services with metadata: timestamp, service, severity, host, region, deployment version
Existing anomaly signals from rules and time-series detectors (spike, error-rate jump, rare template, latency regression)
6 months of historical incidents with analyst-written postmortems and alert labels
Access to an approved small and medium LLM via OpenAI API, plus a feature store / vector index for historical incident retrieval
Security-approved redaction service for secrets and common PII patterns

Task

Design an LLM integration for the anomaly-detection pipeline, including where the LLM is used versus where deterministic/statistical methods remain primary.
Specify the prompt and structured output schema for classifying anomaly type, severity, likely blast radius, confidence, and recommended escalation.
Define an evaluation plan first: offline datasets, hallucination and prompt-injection tests, and online guardrails before rollout.
Propose a production architecture covering batching, retrieval of similar historical incidents, latency/cost controls, and fallback behavior when the LLM is unavailable.
Identify the main risks you would watch for in production and how you would detect and mitigate them.

Context

Constraints

p95 end-to-end latency: < 2,500ms per alert batch
Cost ceiling: < $0.015 per analyzed batch and < $18K/month at 40M log lines/day
Hallucination ceiling: < 2% unsupported root-cause claims on a labeled eval set
False-negative rate on critical incidents must not worsen by more than 1 percentage point versus the current rules-based baseline
Logs may contain PII, secrets, stack traces, and adversarial text; the system must not follow instructions embedded in logs
Output must be machine-readable JSON for downstream alerting and case management

Available Resources

Streaming logs from 2,000 services with metadata: timestamp, service, severity, host, region, deployment version
Existing anomaly signals from rules and time-series detectors (spike, error-rate jump, rare template, latency regression)
6 months of historical incidents with analyst-written postmortems and alert labels
Access to an approved small and medium LLM via OpenAI API, plus a feature store / vector index for historical incident retrieval
Security-approved redaction service for secrets and common PII patterns

Task

Design an LLM integration for the anomaly-detection pipeline, including where the LLM is used versus where deterministic/statistical methods remain primary.
Specify the prompt and structured output schema for classifying anomaly type, severity, likely blast radius, confidence, and recommended escalation.
Define an evaluation plan first: offline datasets, hallucination and prompt-injection tests, and online guardrails before rollout.
Propose a production architecture covering batching, retrieval of similar historical incidents, latency/cost controls, and fallback behavior when the LLM is unavailable.
Identify the main risks you would watch for in production and how you would detect and mitigate them.

Context

Constraints

p95 end-to-end latency: < 2,500ms per alert batch
Cost ceiling: < $0.015 per analyzed batch and < $18K/month at 40M log lines/day
Hallucination ceiling: < 2% unsupported root-cause claims on a labeled eval set
False-negative rate on critical incidents must not worsen by more than 1 percentage point versus the current rules-based baseline
Logs may contain PII, secrets, stack traces, and adversarial text; the system must not follow instructions embedded in logs
Output must be machine-readable JSON for downstream alerting and case management

Available Resources

Streaming logs from 2,000 services with metadata: timestamp, service, severity, host, region, deployment version
Existing anomaly signals from rules and time-series detectors (spike, error-rate jump, rare template, latency regression)
6 months of historical incidents with analyst-written postmortems and alert labels
Access to an approved small and medium LLM via OpenAI API, plus a feature store / vector index for historical incident retrieval
Security-approved redaction service for secrets and common PII patterns

Task

Design an LLM integration for the anomaly-detection pipeline, including where the LLM is used versus where deterministic/statistical methods remain primary.
Specify the prompt and structured output schema for classifying anomaly type, severity, likely blast radius, confidence, and recommended escalation.
Define an evaluation plan first: offline datasets, hallucination and prompt-injection tests, and online guardrails before rollout.
Propose a production architecture covering batching, retrieval of similar historical incidents, latency/cost controls, and fallback behavior when the LLM is unavailable.
Identify the main risks you would watch for in production and how you would detect and mitigate them.

Context

Constraints

p95 end-to-end latency: < 2,500ms per alert batch
Cost ceiling: < $0.015 per analyzed batch and < $18K/month at 40M log lines/day
Hallucination ceiling: < 2% unsupported root-cause claims on a labeled eval set
False-negative rate on critical incidents must not worsen by more than 1 percentage point versus the current rules-based baseline
Logs may contain PII, secrets, stack traces, and adversarial text; the system must not follow instructions embedded in logs
Output must be machine-readable JSON for downstream alerting and case management

Available Resources

Streaming logs from 2,000 services with metadata: timestamp, service, severity, host, region, deployment version
Existing anomaly signals from rules and time-series detectors (spike, error-rate jump, rare template, latency regression)
6 months of historical incidents with analyst-written postmortems and alert labels
Access to an approved small and medium LLM via OpenAI API, plus a feature store / vector index for historical incident retrieval
Security-approved redaction service for secrets and common PII patterns

Task

Design an LLM integration for the anomaly-detection pipeline, including where the LLM is used versus where deterministic/statistical methods remain primary.
Specify the prompt and structured output schema for classifying anomaly type, severity, likely blast radius, confidence, and recommended escalation.
Define an evaluation plan first: offline datasets, hallucination and prompt-injection tests, and online guardrails before rollout.
Propose a production architecture covering batching, retrieval of similar historical incidents, latency/cost controls, and fallback behavior when the LLM is unavailable.
Identify the main risks you would watch for in production and how you would detect and mitigate them.

Interview Guides

Context

Constraints

Available Resources

Task

Detect Log Anomalies with LLMs

Context

Constraints

Available Resources

Task

Your Answer

Detect Log Anomalies with LLMs

Context

Constraints

Available Resources

Task

Detect Log Anomalies with LLMs

Context

Constraints

Available Resources

Task

Your Answer