Design Fast LLM Fallback Responses

Context

FinMate, a personal finance app, uses an LLM assistant to answer user questions about transactions, budgets, and help-center policies. The current system depends on a single external model provider, and response quality drops sharply when that provider is slow or unavailable.

Constraints

p95 end-to-end latency: < 1,500ms during normal operation, < 2,000ms during provider degradation
Availability SLO: 99.9% successful responses
Cost ceiling: $12K/month at 1.2M requests/month
Hallucination rate on a 400-question golden set: < 2% for policy/account answers
For unsupported or degraded scenarios, the system should prefer safe partial answers or explicit fallback/refusal over guessing
Must resist prompt injection from retrieved help-center content or user input
No sensitive account data may be logged in raw form

Available Resources

25K help-center and policy documents, updated daily
Structured product metadata and a small library of deterministic backend APIs (account status, transaction lookup, card limits)
Access to two LLM providers: one high-quality but slower model, and one cheaper/faster model
Redis or equivalent cache, feature flags, and standard observability tooling
Historical logs with latency, provider errors, and user feedback

Task

Design a response architecture that keeps answers fast when the primary provider is slow, rate-limited, or unavailable. Include caching, model fallback, and any degraded-mode behavior.
Specify the prompting strategy for grounded answers, safe refusals, and handling prompt injection in retrieved content.
Define an evaluation-first plan: offline tests for latency/quality/safety and online metrics or experiments to validate the design.
Estimate cost and latency for normal mode and degraded mode, and explain how routing decisions balance both.
Identify the main failure modes, including hallucinations during fallback and stale/unsafe cached responses, and propose mitigations.

Constraints

p95 end-to-end latency: < 1,500ms during normal operation, < 2,000ms during provider degradation

Availability SLO: 99.9% successful responses

Cost ceiling: $12K/month at 1.2M requests/month

Hallucination rate on a 400-question golden set: < 2% for policy/account answers

For unsupported or degraded scenarios, the system should prefer safe partial answers or explicit fallback/refusal over guessing

Must resist prompt injection from retrieved help-center content or user input

No sensitive account data may be logged in raw form

Available Resources

25K help-center and policy documents, updated daily

Structured product metadata and a small library of deterministic backend APIs (account status, transaction lookup, card limits)

Access to two LLM providers: one high-quality but slower model, and one cheaper/faster model

Redis or equivalent cache, feature flags, and standard observability tooling

Historical logs with latency, provider errors, and user feedback

Task

Design a response architecture that keeps answers fast when the primary provider is slow, rate-limited, or unavailable. Include caching, model fallback, and any degraded-mode behavior.

Specify the prompting strategy for grounded answers, safe refusals, and handling prompt injection in retrieved content.

Define an evaluation-first plan: offline tests for latency/quality/safety and online metrics or experiments to validate the design.

Estimate cost and latency for normal mode and degraded mode, and explain how routing decisions balance both.

Identify the main failure modes, including hallucinations during fallback and stale/unsafe cached responses, and propose mitigations.

Constraints

p95 end-to-end latency: < 1,500ms during normal operation, < 2,000ms during provider degradation

Availability SLO: 99.9% successful responses

Cost ceiling: $12K/month at 1.2M requests/month

Hallucination rate on a 400-question golden set: < 2% for policy/account answers

For unsupported or degraded scenarios, the system should prefer safe partial answers or explicit fallback/refusal over guessing

Must resist prompt injection from retrieved help-center content or user input

No sensitive account data may be logged in raw form

Available Resources

25K help-center and policy documents, updated daily

Structured product metadata and a small library of deterministic backend APIs (account status, transaction lookup, card limits)

Access to two LLM providers: one high-quality but slower model, and one cheaper/faster model

Redis or equivalent cache, feature flags, and standard observability tooling

Historical logs with latency, provider errors, and user feedback

Task

Design a response architecture that keeps answers fast when the primary provider is slow, rate-limited, or unavailable. Include caching, model fallback, and any degraded-mode behavior.

Specify the prompting strategy for grounded answers, safe refusals, and handling prompt injection in retrieved content.

Define an evaluation-first plan: offline tests for latency/quality/safety and online metrics or experiments to validate the design.

Estimate cost and latency for normal mode and degraded mode, and explain how routing decisions balance both.

Identify the main failure modes, including hallucinations during fallback and stale/unsafe cached responses, and propose mitigations.

Constraints

p95 end-to-end latency: < 1,500ms during normal operation, < 2,000ms during provider degradation

Availability SLO: 99.9% successful responses

Cost ceiling: $12K/month at 1.2M requests/month

Hallucination rate on a 400-question golden set: < 2% for policy/account answers

For unsupported or degraded scenarios, the system should prefer safe partial answers or explicit fallback/refusal over guessing

Must resist prompt injection from retrieved help-center content or user input

No sensitive account data may be logged in raw form

Available Resources

25K help-center and policy documents, updated daily

Structured product metadata and a small library of deterministic backend APIs (account status, transaction lookup, card limits)

Access to two LLM providers: one high-quality but slower model, and one cheaper/faster model

Redis or equivalent cache, feature flags, and standard observability tooling

Historical logs with latency, provider errors, and user feedback

Task

Design a response architecture that keeps answers fast when the primary provider is slow, rate-limited, or unavailable. Include caching, model fallback, and any degraded-mode behavior.

Specify the prompting strategy for grounded answers, safe refusals, and handling prompt injection in retrieved content.

Define an evaluation-first plan: offline tests for latency/quality/safety and online metrics or experiments to validate the design.

Estimate cost and latency for normal mode and degraded mode, and explain how routing decisions balance both.

Identify the main failure modes, including hallucinations during fallback and stale/unsafe cached responses, and propose mitigations.

Interview Guides

Context

Constraints

Available Resources

Task

Design Fast LLM Fallback Responses

Context

Constraints

Available Resources

Task

Your Answer

Design Fast LLM Fallback Responses

Context

Constraints

Available Resources

Task

Design Fast LLM Fallback Responses

Context

Constraints

Available Resources

Task

Your Answer