Design AI Support Reply Ranking

Product Context

AssistFlow is a client-facing AI support platform used by enterprise customers. End users type a question in a web or mobile chat, and the system must retrieve relevant knowledge, rank candidate answers, and return a high-quality response with low latency.

Scale

Signal	Value
DAU	12M end users
Peak QPS	45K chat turns/sec
Enterprise tenants	18K
Knowledge corpus	220M documents/snippets across tenants
New/updated documents	9M per day
p99 latency budget	900ms end-to-end
Context window for final response	up to 16K tokens

Task

Design an end-to-end ML system for this application, from ingestion through production monitoring. Your design should address:

How data is ingested, cleaned, versioned, and made available for training and serving
A multi-stage online architecture for retrieval, ranking, and optional final re-ranking or answer synthesis
Model choices for each stage, including how you handle tenant isolation, freshness, and cold start
Training, offline evaluation, and online experimentation strategy
Monitoring, alerting, rollback, and failure handling in production

Constraints

Tenant data must remain logically isolated; cross-tenant leakage is a hard compliance failure
35% of documents are updated within 24 hours, so freshness matters
GPU budget is limited: average inference cost must stay below $0.012 per request
Some customers require explainability: the system must expose citations and confidence signals
The product must degrade gracefully if the generative stage is unavailable; retrieval-only answers are acceptable
Training labels are partially delayed and noisy because many users do not explicitly rate responses

Signal

Value

DAU

12M end users

Peak QPS

45K chat turns/sec

Enterprise tenants

18K

Knowledge corpus

220M documents/snippets across tenants

New/updated documents

9M per day

p99 latency budget

900ms end-to-end

Context window for final response

up to 16K tokens

Task

Design an end-to-end ML system for this application, from ingestion through production monitoring. Your design should address:

How data is ingested, cleaned, versioned, and made available for training and serving

A multi-stage online architecture for retrieval, ranking, and optional final re-ranking or answer synthesis

Model choices for each stage, including how you handle tenant isolation, freshness, and cold start

Training, offline evaluation, and online experimentation strategy

Monitoring, alerting, rollback, and failure handling in production

Constraints

Tenant data must remain logically isolated; cross-tenant leakage is a hard compliance failure

35% of documents are updated within 24 hours, so freshness matters

GPU budget is limited: average inference cost must stay below $0.012 per request

Some customers require explainability: the system must expose citations and confidence signals

The product must degrade gracefully if the generative stage is unavailable; retrieval-only answers are acceptable

Training labels are partially delayed and noisy because many users do not explicitly rate responses

Signal

Value

DAU

12M end users

Peak QPS

45K chat turns/sec

Enterprise tenants

18K

Knowledge corpus

220M documents/snippets across tenants

New/updated documents

9M per day

p99 latency budget

900ms end-to-end

Context window for final response

up to 16K tokens

Task

Design an end-to-end ML system for this application, from ingestion through production monitoring. Your design should address:

How data is ingested, cleaned, versioned, and made available for training and serving

A multi-stage online architecture for retrieval, ranking, and optional final re-ranking or answer synthesis

Model choices for each stage, including how you handle tenant isolation, freshness, and cold start

Training, offline evaluation, and online experimentation strategy

Monitoring, alerting, rollback, and failure handling in production

Constraints

Tenant data must remain logically isolated; cross-tenant leakage is a hard compliance failure

35% of documents are updated within 24 hours, so freshness matters

GPU budget is limited: average inference cost must stay below $0.012 per request

Some customers require explainability: the system must expose citations and confidence signals

The product must degrade gracefully if the generative stage is unavailable; retrieval-only answers are acceptable

Training labels are partially delayed and noisy because many users do not explicitly rate responses

Signal

Value

DAU

12M end users

Peak QPS

45K chat turns/sec

Enterprise tenants

18K

Knowledge corpus

220M documents/snippets across tenants

New/updated documents

9M per day

p99 latency budget

900ms end-to-end

Context window for final response

up to 16K tokens

Task

Design an end-to-end ML system for this application, from ingestion through production monitoring. Your design should address:

How data is ingested, cleaned, versioned, and made available for training and serving

A multi-stage online architecture for retrieval, ranking, and optional final re-ranking or answer synthesis

Model choices for each stage, including how you handle tenant isolation, freshness, and cold start

Training, offline evaluation, and online experimentation strategy

Monitoring, alerting, rollback, and failure handling in production

Constraints

Tenant data must remain logically isolated; cross-tenant leakage is a hard compliance failure

35% of documents are updated within 24 hours, so freshness matters

GPU budget is limited: average inference cost must stay below $0.012 per request

Some customers require explainability: the system must expose citations and confidence signals

The product must degrade gracefully if the generative stage is unavailable; retrieval-only answers are acceptable

Training labels are partially delayed and noisy because many users do not explicitly rate responses

Interview Guides

Product Context

Scale

Task

Constraints

Design AI Support Reply Ranking

Product Context

Scale

Task

Constraints

Your Answer

Design AI Support Reply Ranking

Product Context

Scale

Task

Constraints

Design AI Support Reply Ranking

Product Context

Scale

Task

Constraints

Your Answer