Context
FinPilot is an LLM-powered support copilot for a fintech operations team. The team frequently updates prompts to improve answer quality, but recent changes have caused regressions in refusal behavior, cache hit rate, and unsupported claims.
Constraints
- p95 end-to-end latency: 1,200ms
- Cost ceiling: $12K/month at 400K requests/month
- Hallucination rate: <2% on a 300-example golden set
- Prompt injection success rate: <0.5% on adversarial tests
- Rollback to last known-good prompt version must complete in <5 minutes
- Cached responses must not leak tenant data or stale policy content
Available Resources
- Historical prompts, prompt templates, and release notes for the last 20 versions
- Request/response logs with metadata: tenant_id, task_type, latency, token counts, user feedback, escalation flag
- A 300-example labeled golden set and a 50-example adversarial prompt-injection set
- Approved models: GPT-4.1-mini for primary serving, GPT-4.1 for offline judging
- Redis for response caching and Postgres for prompt registry / version metadata
Task
- Design a prompt versioning strategy that supports reproducibility, staged rollout, auditability, and fast rollback. Be explicit about what constitutes a version: system prompt, few-shot examples, model parameters, output schema, and safety rules.
- Define a caching strategy for prompt-based LLM calls, including cache keys, invalidation rules, TTLs, and when caching should be disabled. Address tenant isolation, prompt-version awareness, and stale-answer risk.
- Specify an evaluation plan before rollout: offline tests for quality and safety, plus online monitoring and canary metrics to decide promotion or rollback.
- Propose a rollback mechanism for bad prompt releases, including triggers, blast-radius control, and how you would preserve debuggability after rollback.
- Estimate the cost and latency impact of versioned prompts, cache hits, canary traffic, and rollback safeguards.