Context
Datadog-style observability startup LogPulse wants an LLM-powered layer in its incident pipeline to help SREs detect and summarize anomalies in application and infrastructure logs. The goal is not to replace statistical detectors, but to improve triage quality by extracting structured anomaly signals, grouping related events, and generating grounded incident summaries.
Constraints
- p95 end-to-end latency: < 2,500ms per alert batch
- Cost ceiling: < $0.015 per analyzed batch and < $18K/month at 40M log lines/day
- Hallucination ceiling: < 2% unsupported root-cause claims on a labeled eval set
- False-negative rate on critical incidents must not worsen by more than 1 percentage point versus the current rules-based baseline
- Logs may contain PII, secrets, stack traces, and adversarial text; the system must not follow instructions embedded in logs
- Output must be machine-readable JSON for downstream alerting and case management
Available Resources
- Streaming logs from 2,000 services with metadata: timestamp, service, severity, host, region, deployment version
- Existing anomaly signals from rules and time-series detectors (spike, error-rate jump, rare template, latency regression)
- 6 months of historical incidents with analyst-written postmortems and alert labels
- Access to an approved small and medium LLM via OpenAI API, plus a feature store / vector index for historical incident retrieval
- Security-approved redaction service for secrets and common PII patterns
Task
- Design an LLM integration for the anomaly-detection pipeline, including where the LLM is used versus where deterministic/statistical methods remain primary.
- Specify the prompt and structured output schema for classifying anomaly type, severity, likely blast radius, confidence, and recommended escalation.
- Define an evaluation plan first: offline datasets, hallucination and prompt-injection tests, and online guardrails before rollout.
- Propose a production architecture covering batching, retrieval of similar historical incidents, latency/cost controls, and fallback behavior when the LLM is unavailable.
- Identify the main risks you would watch for in production and how you would detect and mitigate them.