Diagnose Slow Multi-GPU Vector Search

Context

ShopLens runs semantic product retrieval for a marketplace with 180M item embeddings and 12K queries per second at peak. The team recently moved ANN search from 1 GPU to 4 GPUs to preserve recall at larger corpus size, but latency is now above the serving SLO even though retrieval quality remains strong.

Current Performance

Metric	Single GPU Baseline	Current 4-GPU System	Target
Recall@100	0.942	0.968	>= 0.960
Precision@10	0.611	0.618	>= 0.610
F1@10	0.744	0.758	>= 0.750
P50 latency	38 ms	71 ms	<= 45 ms
P95 latency	82 ms	184 ms	<= 100 ms
P99 latency	121 ms	267 ms	<= 150 ms
QPS sustained	8,900	12,400	>= 12,000
GPU utilization	76%	43% avg	70-85%
Host CPU utilization	41%	88%	< 70%
Inter-GPU transfer / query	2.1 MB	11.8 MB	< 4 MB

The Problem

The retrieval team believes the index scaling strategy improved recall, but the multi-GPU deployment introduced a latency bottleneck somewhere in query fan-out, candidate merge, memory movement, or load imbalance. You need to diagnose the most likely bottlenecks and recommend the first optimizations to try.

Requirements

Interpret what the metric pattern suggests about system bottlenecks.
Identify the most likely latency contributors in a multi-GPU ANN stack.
Prioritize the first 4-5 optimizations you would test.
Explain what tradeoffs you would monitor so recall does not regress.
Propose a validation plan to confirm the true root cause.

Constraints

Recall@100 cannot fall below 0.960.
No additional GPUs this quarter.
End-to-end P95 latency must reach 100 ms within 3 weeks.
Rebuilding the full index takes 36 hours, so experiments must be staged carefully.

Context

Current Performance

Metric	Single GPU Baseline	Current 4-GPU System	Target
Recall@100	0.942	0.968	>= 0.960
Precision@10	0.611	0.618	>= 0.610
F1@10	0.744	0.758	>= 0.750
P50 latency	38 ms	71 ms	<= 45 ms
P95 latency	82 ms	184 ms	<= 100 ms
P99 latency	121 ms	267 ms	<= 150 ms
QPS sustained	8,900	12,400	>= 12,000
GPU utilization	76%	43% avg	70-85%
Host CPU utilization	41%	88%	< 70%
Inter-GPU transfer / query	2.1 MB	11.8 MB	< 4 MB

The Problem

Requirements

Interpret what the metric pattern suggests about system bottlenecks.
Identify the most likely latency contributors in a multi-GPU ANN stack.
Prioritize the first 4-5 optimizations you would test.
Explain what tradeoffs you would monitor so recall does not regress.
Propose a validation plan to confirm the true root cause.

Constraints

Recall@100 cannot fall below 0.960.
No additional GPUs this quarter.
End-to-end P95 latency must reach 100 ms within 3 weeks.
Rebuilding the full index takes 36 hours, so experiments must be staged carefully.

Context

Current Performance

Metric	Single GPU Baseline	Current 4-GPU System	Target
Recall@100	0.942	0.968	>= 0.960
Precision@10	0.611	0.618	>= 0.610
F1@10	0.744	0.758	>= 0.750
P50 latency	38 ms	71 ms	<= 45 ms
P95 latency	82 ms	184 ms	<= 100 ms
P99 latency	121 ms	267 ms	<= 150 ms
QPS sustained	8,900	12,400	>= 12,000
GPU utilization	76%	43% avg	70-85%
Host CPU utilization	41%	88%	< 70%
Inter-GPU transfer / query	2.1 MB	11.8 MB	< 4 MB

The Problem

Requirements

Interpret what the metric pattern suggests about system bottlenecks.
Identify the most likely latency contributors in a multi-GPU ANN stack.
Prioritize the first 4-5 optimizations you would test.
Explain what tradeoffs you would monitor so recall does not regress.
Propose a validation plan to confirm the true root cause.

Constraints

Recall@100 cannot fall below 0.960.
No additional GPUs this quarter.
End-to-end P95 latency must reach 100 ms within 3 weeks.
Rebuilding the full index takes 36 hours, so experiments must be staged carefully.

Context

Current Performance

Metric	Single GPU Baseline	Current 4-GPU System	Target
Recall@100	0.942	0.968	>= 0.960
Precision@10	0.611	0.618	>= 0.610
F1@10	0.744	0.758	>= 0.750
P50 latency	38 ms	71 ms	<= 45 ms
P95 latency	82 ms	184 ms	<= 100 ms
P99 latency	121 ms	267 ms	<= 150 ms
QPS sustained	8,900	12,400	>= 12,000
GPU utilization	76%	43% avg	70-85%
Host CPU utilization	41%	88%	< 70%
Inter-GPU transfer / query	2.1 MB	11.8 MB	< 4 MB

The Problem

Requirements

Interpret what the metric pattern suggests about system bottlenecks.
Identify the most likely latency contributors in a multi-GPU ANN stack.
Prioritize the first 4-5 optimizations you would test.
Explain what tradeoffs you would monitor so recall does not regress.
Propose a validation plan to confirm the true root cause.

Constraints

Recall@100 cannot fall below 0.960.
No additional GPUs this quarter.
End-to-end P95 latency must reach 100 ms within 3 weeks.
Rebuilding the full index takes 36 hours, so experiments must be staged carefully.

Interview Guides

Context

Current Performance

The Problem

Requirements

Constraints

Diagnose Slow Multi-GPU Vector Search

Context

Current Performance

The Problem

Requirements

Constraints

Your Answer

Diagnose Slow Multi-GPU Vector Search

Context

Current Performance

The Problem

Requirements

Constraints

Diagnose Slow Multi-GPU Vector Search

Context

Current Performance

The Problem

Requirements

Constraints

Your Answer