ShopLens runs semantic product retrieval for a marketplace with 180M item embeddings and 12K queries per second at peak. The team recently moved ANN search from 1 GPU to 4 GPUs to preserve recall at larger corpus size, but latency is now above the serving SLO even though retrieval quality remains strong.
| Metric | Single GPU Baseline | Current 4-GPU System | Target |
|---|---|---|---|
| Recall@100 | 0.942 | 0.968 | >= 0.960 |
| Precision@10 | 0.611 | 0.618 | >= 0.610 |
| F1@10 | 0.744 | 0.758 | >= 0.750 |
| P50 latency | 38 ms | 71 ms | <= 45 ms |
| P95 latency | 82 ms | 184 ms | <= 100 ms |
| P99 latency | 121 ms | 267 ms | <= 150 ms |
| QPS sustained | 8,900 | 12,400 | >= 12,000 |
| GPU utilization | 76% | 43% avg | 70-85% |
| Host CPU utilization | 41% | 88% | < 70% |
| Inter-GPU transfer / query | 2.1 MB | 11.8 MB | < 4 MB |
The retrieval team believes the index scaling strategy improved recall, but the multi-GPU deployment introduced a latency bottleneck somewhere in query fan-out, candidate merge, memory movement, or load imbalance. You need to diagnose the most likely bottlenecks and recommend the first optimizations to try.