Diagnose Slow Multi-GPU Vector Search

Hard

Model Evaluation

Asked at 1 company1PrecisionRecallThreshold Tuning

Also asked at

Problem

Context

ShopLens runs semantic product retrieval for a marketplace with 180M item embeddings and 12K queries per second at peak. The team recently moved ANN search from 1 GPU to 4 GPUs to preserve recall at larger corpus size, but latency is now above the serving SLO even though retrieval quality remains strong.

Current Performance

Metric	Single GPU Baseline	Current 4-GPU System	Target
Recall@100	0.942	0.968	>= 0.960
Precision@10	0.611	0.618	>= 0.610
F1@10	0.744	0.758	>= 0.750
P50 latency	38 ms	71 ms	<= 45 ms
P95 latency	82 ms	184 ms	<= 100 ms
P99 latency	121 ms	267 ms	<= 150 ms
QPS sustained	8,900	12,400	>= 12,000
GPU utilization	76%	43% avg	70-85%
Host CPU utilization	41%	88%	< 70%
Inter-GPU transfer / query	2.1 MB	11.8 MB	< 4 MB

The Problem

The retrieval team believes the index scaling strategy improved recall, but the multi-GPU deployment introduced a latency bottleneck somewhere in query fan-out, candidate merge, memory movement, or load imbalance. You need to diagnose the most likely bottlenecks and recommend the first optimizations to try.

Requirements

Interpret what the metric pattern suggests about system bottlenecks.
Identify the most likely latency contributors in a multi-GPU ANN stack.
Prioritize the first 4-5 optimizations you would test.
Explain what tradeoffs you would monitor so recall does not regress.
Propose a validation plan to confirm the true root cause.

Constraints

Recall@100 cannot fall below 0.960.
No additional GPUs this quarter.
End-to-end P95 latency must reach 100 ms within 3 weeks.
Rebuilding the full index takes 36 hours, so experiments must be staged carefully.

Problem

Context

Current Performance

Metric	Single GPU Baseline	Current 4-GPU System	Target
Recall@100	0.942	0.968	>= 0.960
Precision@10	0.611	0.618	>= 0.610
F1@10	0.744	0.758	>= 0.750
P50 latency	38 ms	71 ms	<= 45 ms
P95 latency	82 ms	184 ms	<= 100 ms
P99 latency	121 ms	267 ms	<= 150 ms
QPS sustained	8,900	12,400	>= 12,000
GPU utilization	76%	43% avg	70-85%
Host CPU utilization	41%	88%	< 70%
Inter-GPU transfer / query	2.1 MB	11.8 MB	< 4 MB

The Problem

Requirements

Interpret what the metric pattern suggests about system bottlenecks.
Identify the most likely latency contributors in a multi-GPU ANN stack.
Prioritize the first 4-5 optimizations you would test.
Explain what tradeoffs you would monitor so recall does not regress.
Propose a validation plan to confirm the true root cause.

Constraints

Recall@100 cannot fall below 0.960.
No additional GPUs this quarter.
End-to-end P95 latency must reach 100 ms within 3 weeks.
Rebuilding the full index takes 36 hours, so experiments must be staged carefully.

Your answer

Try one AI text evaluation on us

Get structured feedback, scored against a 4-axis rubric. Premium unlocks unlimited.

0 wordstarget ~200

Up next

Evaluate Distributed Inference Scaling MetricsMedium

Design ML Serving Platform for Vehicle GPUsHard

Design a Fair Cross-Hardware BenchmarkMedium

Next question

Context

Current Performance

Metric	Single GPU Baseline	Current 4-GPU System	Target
Recall@100	0.942	0.968	>= 0.960
Precision@10	0.611	0.618	>= 0.610
F1@10	0.744	0.758	>= 0.750
P50 latency	38 ms	71 ms	<= 45 ms
P95 latency	82 ms	184 ms	<= 100 ms
P99 latency	121 ms	267 ms	<= 150 ms
QPS sustained	8,900	12,400	>= 12,000
GPU utilization	76%	43% avg	70-85%
Host CPU utilization	41%	88%	< 70%
Inter-GPU transfer / query	2.1 MB	11.8 MB	< 4 MB

The Problem

Requirements

Interpret what the metric pattern suggests about system bottlenecks.

Identify the most likely latency contributors in a multi-GPU ANN stack.

Prioritize the first 4-5 optimizations you would test.

Explain what tradeoffs you would monitor so recall does not regress.

Propose a validation plan to confirm the true root cause.

Context

Current Performance

Metric	Single GPU Baseline	Current 4-GPU System	Target
Recall@100	0.942	0.968	>= 0.960
Precision@10	0.611	0.618	>= 0.610
F1@10	0.744	0.758	>= 0.750
P50 latency	38 ms	71 ms	<= 45 ms
P95 latency	82 ms	184 ms	<= 100 ms
P99 latency	121 ms	267 ms	<= 150 ms
QPS sustained	8,900	12,400	>= 12,000
GPU utilization	76%	43% avg	70-85%
Host CPU utilization	41%	88%	< 70%
Inter-GPU transfer / query	2.1 MB	11.8 MB	< 4 MB

The Problem

Requirements

Interpret what the metric pattern suggests about system bottlenecks.

Identify the most likely latency contributors in a multi-GPU ANN stack.

Prioritize the first 4-5 optimizations you would test.

Explain what tradeoffs you would monitor so recall does not regress.

Propose a validation plan to confirm the true root cause.