Business Context
Apex Dynamics runs large-scale engineering simulations and uses an internal LLM agent to answer questions about failures, anomalies, and run outcomes. Individual simulation logs are often much longer than the model context window, so the team needs an NLP pipeline that preserves critical evidence while enabling accurate downstream question answering.
Data
- Volume: ~250,000 simulation runs per month, each with 5-50 log files
- Text length: 20K-2M tokens per run after concatenation; median run size is ~180K tokens
- Language: Primarily English with structured timestamps, error codes, stack traces, metrics, and subsystem names
- Label distribution: Historical annotations exist for 8 incident types, but only ~18% of runs have analyst-written summaries or root-cause notes
- Noise: Repeated heartbeat messages, duplicated warnings, and verbose debug traces account for ~40% of tokens
Success Criteria
A good solution should retain the events needed to answer analyst questions, achieve strong retrieval quality on root-cause evidence, and produce concise summaries that improve agent answer accuracy. Target p95 retrieval latency under 800ms per query and summary generation under 3s per run segment batch.
Constraints
- Must run in a secure VPC; no external API calls
- GPU budget is limited to a single A10 for batch summarization jobs
- The final agent prompt must stay within an 8K-token context window
- The pipeline should support both offline batch processing and near-real-time incident investigation
Requirements
- Design a long-context handling pipeline for extensive simulation logs.
- Implement chunking, deduplication, retrieval, and hierarchical summarization.
- Build a model that classifies or ranks chunks by incident relevance before final summarization.
- Explain how you would preserve temporal order and cross-file dependencies.
- Define evaluation for retrieval quality, summary faithfulness, and downstream QA performance.