Context
Asana is considering broader rollout of Asana AI writing assistance inside task creation, project updates, and status summaries. The feature can rewrite text, draft summaries from task context, and suggest next steps, but leadership wants to know whether it is genuinely helping users complete work better rather than merely feeling novel.
Constraints
- p95 end-to-end latency: < 1,500 ms for inline suggestions in Asana
- Cost ceiling: < $0.015 per assisted interaction and < $120K/month at projected scale
- Hallucination ceiling: < 2% of responses may introduce unsupported facts about task status, owners, deadlines, or dependencies
- Safety: must resist prompt injection from task descriptions/comments, avoid leaking private project data across workspaces, and refuse when context is insufficient
- UX constraint: no long multi-turn flow for common actions; most responses should be one-shot
Available Resources
- Historical anonymized Asana interactions: accepted/rejected AI suggestions, user edits after acceptance, follow-up task changes, re-opened tasks, and manual rewrites
- Workspace-scoped context: task title, description, comments, custom fields, project brief, recent status updates, and user permissions
- Existing LLM providers approved by Asana, plus a smaller low-cost model for routing or judging
- Human reviewers from product ops who can label a golden set for helpfulness, correctness, and safety
Task
- Define an evaluation framework that determines whether Asana AI is helping users beyond novelty, including primary success metrics, guardrails, and segment-level analysis.
- Design the offline evaluation plan first: golden set creation, labeling rubric, LLM-as-judge calibration, and adversarial tests for hallucination and prompt injection.
- Propose the online evaluation / experiment design to measure durable user value, not just short-term engagement with AI suggestions.
- Specify the prompting and serving approach for generating grounded writing assistance from Asana context, including refusal behavior when context is incomplete.
- Estimate cost and latency, and explain what you would change if the feature misses either budget or quality targets.