Version Prompts with Safe Rollback

Context

FinPilot is an LLM-powered support copilot for a fintech operations team. The team frequently updates prompts to improve answer quality, but recent changes have caused regressions in refusal behavior, cache hit rate, and unsupported claims.

Constraints

p95 end-to-end latency: 1,200ms
Cost ceiling: $12K/month at 400K requests/month
Hallucination rate: <2% on a 300-example golden set
Prompt injection success rate: <0.5% on adversarial tests
Rollback to last known-good prompt version must complete in <5 minutes
Cached responses must not leak tenant data or stale policy content

Available Resources

Historical prompts, prompt templates, and release notes for the last 20 versions
Request/response logs with metadata: tenant_id, task_type, latency, token counts, user feedback, escalation flag
A 300-example labeled golden set and a 50-example adversarial prompt-injection set
Approved models: GPT-4.1-mini for primary serving, GPT-4.1 for offline judging
Redis for response caching and Postgres for prompt registry / version metadata

Task

Design a prompt versioning strategy that supports reproducibility, staged rollout, auditability, and fast rollback. Be explicit about what constitutes a version: system prompt, few-shot examples, model parameters, output schema, and safety rules.
Define a caching strategy for prompt-based LLM calls, including cache keys, invalidation rules, TTLs, and when caching should be disabled. Address tenant isolation, prompt-version awareness, and stale-answer risk.
Specify an evaluation plan before rollout: offline tests for quality and safety, plus online monitoring and canary metrics to decide promotion or rollback.
Propose a rollback mechanism for bad prompt releases, including triggers, blast-radius control, and how you would preserve debuggability after rollback.
Estimate the cost and latency impact of versioned prompts, cache hits, canary traffic, and rollback safeguards.

Constraints

p95 end-to-end latency: 1,200ms

Cost ceiling: $12K/month at 400K requests/month

Hallucination rate: <2% on a 300-example golden set

Prompt injection success rate: <0.5% on adversarial tests

Rollback to last known-good prompt version must complete in <5 minutes

Cached responses must not leak tenant data or stale policy content

Available Resources

Historical prompts, prompt templates, and release notes for the last 20 versions

Request/response logs with metadata: tenant_id, task_type, latency, token counts, user feedback, escalation flag

A 300-example labeled golden set and a 50-example adversarial prompt-injection set

Approved models: GPT-4.1-mini for primary serving, GPT-4.1 for offline judging

Redis for response caching and Postgres for prompt registry / version metadata

Task

Design a prompt versioning strategy that supports reproducibility, staged rollout, auditability, and fast rollback. Be explicit about what constitutes a version: system prompt, few-shot examples, model parameters, output schema, and safety rules.

Define a caching strategy for prompt-based LLM calls, including cache keys, invalidation rules, TTLs, and when caching should be disabled. Address tenant isolation, prompt-version awareness, and stale-answer risk.

Specify an evaluation plan before rollout: offline tests for quality and safety, plus online monitoring and canary metrics to decide promotion or rollback.

Propose a rollback mechanism for bad prompt releases, including triggers, blast-radius control, and how you would preserve debuggability after rollback.

Estimate the cost and latency impact of versioned prompts, cache hits, canary traffic, and rollback safeguards.

Constraints

p95 end-to-end latency: 1,200ms

Cost ceiling: $12K/month at 400K requests/month

Hallucination rate: <2% on a 300-example golden set

Prompt injection success rate: <0.5% on adversarial tests

Rollback to last known-good prompt version must complete in <5 minutes

Cached responses must not leak tenant data or stale policy content

Available Resources

Historical prompts, prompt templates, and release notes for the last 20 versions

Request/response logs with metadata: tenant_id, task_type, latency, token counts, user feedback, escalation flag

A 300-example labeled golden set and a 50-example adversarial prompt-injection set

Approved models: GPT-4.1-mini for primary serving, GPT-4.1 for offline judging

Redis for response caching and Postgres for prompt registry / version metadata

Task

Specify an evaluation plan before rollout: offline tests for quality and safety, plus online monitoring and canary metrics to decide promotion or rollback.

Propose a rollback mechanism for bad prompt releases, including triggers, blast-radius control, and how you would preserve debuggability after rollback.

Estimate the cost and latency impact of versioned prompts, cache hits, canary traffic, and rollback safeguards.

Constraints

p95 end-to-end latency: 1,200ms

Cost ceiling: $12K/month at 400K requests/month

Hallucination rate: <2% on a 300-example golden set

Prompt injection success rate: <0.5% on adversarial tests

Rollback to last known-good prompt version must complete in <5 minutes

Cached responses must not leak tenant data or stale policy content

Available Resources

Historical prompts, prompt templates, and release notes for the last 20 versions

Request/response logs with metadata: tenant_id, task_type, latency, token counts, user feedback, escalation flag

A 300-example labeled golden set and a 50-example adversarial prompt-injection set

Approved models: GPT-4.1-mini for primary serving, GPT-4.1 for offline judging

Redis for response caching and Postgres for prompt registry / version metadata

Task

Specify an evaluation plan before rollout: offline tests for quality and safety, plus online monitoring and canary metrics to decide promotion or rollback.

Propose a rollback mechanism for bad prompt releases, including triggers, blast-radius control, and how you would preserve debuggability after rollback.

Estimate the cost and latency impact of versioned prompts, cache hits, canary traffic, and rollback safeguards.

Interview Guides

Problem

Context

Constraints

Available Resources

Task

Problem

Context

Constraints

Available Resources

Task

Version Prompts with Safe Rollback

Problem

Context

Constraints

Available Resources

Task

Problem

Context

Constraints

Available Resources

Task