Dataford
Interview Guides
Upgrade
All questions/Model Evaluation/Diagnose Slow Multi-GPU Vector Search

Diagnose Slow Multi-GPU Vector Search

Hard
Model Evaluation
Asked at 1 company1PrecisionRecallThreshold Tuning
Also asked at
NVIDIA

Problem

Context

ShopLens runs semantic product retrieval for a marketplace with 180M item embeddings and 12K queries per second at peak. The team recently moved ANN search from 1 GPU to 4 GPUs to preserve recall at larger corpus size, but latency is now above the serving SLO even though retrieval quality remains strong.

Current Performance

MetricSingle GPU BaselineCurrent 4-GPU SystemTarget
Recall@1000.9420.968>= 0.960
Precision@100.6110.618>= 0.610
F1@100.7440.758>= 0.750
P50 latency38 ms71 ms<= 45 ms
P95 latency82 ms184 ms<= 100 ms
P99 latency121 ms267 ms<= 150 ms
QPS sustained8,90012,400>= 12,000
GPU utilization76%43% avg70-85%
Host CPU utilization41%88%< 70%
Inter-GPU transfer / query2.1 MB11.8 MB< 4 MB

The Problem

The retrieval team believes the index scaling strategy improved recall, but the multi-GPU deployment introduced a latency bottleneck somewhere in query fan-out, candidate merge, memory movement, or load imbalance. You need to diagnose the most likely bottlenecks and recommend the first optimizations to try.

Requirements

  1. Interpret what the metric pattern suggests about system bottlenecks.
  2. Identify the most likely latency contributors in a multi-GPU ANN stack.
  3. Prioritize the first 4-5 optimizations you would test.
  4. Explain what tradeoffs you would monitor so recall does not regress.
  5. Propose a validation plan to confirm the true root cause.

Constraints

  • Recall@100 cannot fall below 0.960.
  • No additional GPUs this quarter.
  • End-to-end P95 latency must reach 100 ms within 3 weeks.
  • Rebuilding the full index takes 36 hours, so experiments must be staged carefully.

Problem

Context

ShopLens runs semantic product retrieval for a marketplace with 180M item embeddings and 12K queries per second at peak. The team recently moved ANN search from 1 GPU to 4 GPUs to preserve recall at larger corpus size, but latency is now above the serving SLO even though retrieval quality remains strong.

Current Performance

MetricSingle GPU BaselineCurrent 4-GPU SystemTarget
Recall@1000.9420.968>= 0.960
Precision@100.6110.618>= 0.610
F1@100.7440.758>= 0.750
P50 latency38 ms71 ms<= 45 ms
P95 latency82 ms184 ms<= 100 ms
P99 latency121 ms267 ms<= 150 ms
QPS sustained8,90012,400>= 12,000
GPU utilization76%43% avg70-85%
Host CPU utilization41%88%< 70%
Inter-GPU transfer / query2.1 MB11.8 MB< 4 MB

The Problem

The retrieval team believes the index scaling strategy improved recall, but the multi-GPU deployment introduced a latency bottleneck somewhere in query fan-out, candidate merge, memory movement, or load imbalance. You need to diagnose the most likely bottlenecks and recommend the first optimizations to try.

Requirements

  1. Interpret what the metric pattern suggests about system bottlenecks.
  2. Identify the most likely latency contributors in a multi-GPU ANN stack.
  3. Prioritize the first 4-5 optimizations you would test.
  4. Explain what tradeoffs you would monitor so recall does not regress.
  5. Propose a validation plan to confirm the true root cause.

Constraints

  • Recall@100 cannot fall below 0.960.
  • No additional GPUs this quarter.
  • End-to-end P95 latency must reach 100 ms within 3 weeks.
  • Rebuilding the full index takes 36 hours, so experiments must be staged carefully.
Your answer
Try one AI text evaluation on us
Get structured feedback, scored against a 4-axis rubric. Premium unlocks unlimited.
0 wordstarget ~200
Up next
NVIDIAEvaluate Distributed Inference Scaling MetricsMediumZooxDesign ML Serving Platform for Vehicle GPUsHardNVIDIADesign a Fair Cross-Hardware BenchmarkMedium
Next question