Context
StreamForge has a prototype GenAI copilot that helps creators apply real-time media effects during live video sessions. Users type requests like "make the background cyberpunk, keep skin tones natural, and add subtle bass-reactive lighting," and the system must translate them into safe, executable effect graphs and parameter updates without disrupting the live stream.
Constraints
- End-to-end p95 latency: 350ms for effect updates during a live session
- Cost ceiling: $0.015 per session-minute at 200K monthly live sessions
- Execution accuracy: at least 92% task success on a labeled test set of creator requests
- Hallucination ceiling: <2% invalid or unsupported effect/tool calls
- Safety: no unsafe visual outputs, no leaking hidden system instructions, and resilience to prompt injection from user text overlays, scene metadata, or retrieved effect docs
- Fallback behavior: if confidence is low, ask one short clarification or return a safe no-op recommendation
Available Resources
- A catalog of 1,200 supported effects, each with parameter schemas, latency cost, GPU cost, and compatibility constraints
- Historical logs from the prototype: 3M user prompts, chosen effects, manual corrections, and session outcomes
- Real-time tools:
list_effects, validate_graph, estimate_render_cost, apply_effect_patch, and rollback_patch
- Models available: a fast small LLM for routing/classification and a stronger model for complex composition
- Optional retrieval index over effect documentation, examples, and policy rules
Task
- Design how you would productize the prototype into a production LLM system for real-time effect generation, including prompting, tool use, and fallback behavior.
- Define an eval-first plan: offline evaluation before launch and online monitoring after launch. Be explicit about hallucination, prompt injection, and invalid tool-call rates.
- Propose the serving architecture and retrieval strategy, if any, that can meet the latency and cost constraints.
- Explain whether you would rely on prompt engineering, fine-tuning, an agent loop, or a hybrid approach, and why.
- Estimate cost/latency and list the top failure modes, mitigations, and launch guardrails.