Context
Cognition wants to expand how engineering managers use Devin to improve team workflows: triaging bugs, drafting small code changes, summarizing PRs, and answering questions about internal runbooks. Today, usage is ad hoc and hard to measure; leadership wants a practical design that improves team throughput without creating unsafe or low-trust automation.
Constraints
- p95 end-to-end latency: < 12 seconds for a single workflow request
- Cost ceiling: < $8 per engineer per month at 2,000 assisted tasks/day across the org
- Hallucination ceiling: < 2% on a labeled offline set for factual workflow answers
- Unsafe action rate (wrong repo, bad command, policy violation): < 0.5%
- Must be resilient to prompt injection from issue descriptions, code comments, and retrieved docs
- Human approval is required before any write action (opening PRs, editing configs, posting incident updates)
Available Resources
- Devin with access to GitHub repos, PRs, issues, CI logs, and internal engineering docs
- 12 months of historical tickets, PR discussions, incident retros, and runbooks
- Approved LLM APIs (OpenAI or Anthropic), internal vector search, and basic telemetry
- 20 senior engineers available to label a golden set of successful vs failed workflow outcomes
Task
- Design an agentic workflow assistant around Devin for 2-3 high-value engineering tasks, including when it should retrieve docs, ask clarifying questions, or stop and escalate.
- Define the evaluation plan first: offline golden-set evaluation, safety/adversarial tests, and online success metrics after launch.
- Write a strong system prompt that constrains tool use, enforces grounded behavior, and refuses unsafe or unsupported actions.
- Propose an architecture covering retrieval, planning/tool use, approval gates, and observability.
- Estimate cost and latency, then explain the main tradeoffs and failure modes.