VectorServe runs a distributed recommendation inference service across GPU nodes. The team recently upgraded the serving stack from 8 to 64 GPUs, but product latency complaints increased and infrastructure cost rose faster than throughput. Leadership wants a clear evaluation of throughput, latency, utilization, strong scaling, weak scaling, and the gap between observed speedup and Amdahl’s law.
| Configuration | GPUs | Global Batch | Throughput (req/s) | P50 Latency (ms) | P95 Latency (ms) | GPU Utilization | Speedup vs 8 GPUs |
|---|---|---|---|---|---|---|---|
| Baseline | 8 | 8,000 | 12,000 | 42 | 95 | 78% | 1.00x |
| Strong scaling test | 16 | 8,000 | 20,400 | 39 | 110 | 71% | 1.70x |
| Strong scaling test | 32 | 8,000 | 31,200 | 37 | 145 | 62% | 2.60x |
| Strong scaling test | 64 | 8,000 | 40,800 | 36 | 210 | 49% | 3.40x |
| Weak scaling test | 16 | 16,000 | 23,600 | 44 | 108 | 75% | - |
| Weak scaling test | 32 | 32,000 | 45,100 | 47 | 132 | 68% | - |
| Weak scaling test | 64 | 64,000 | 83,500 | 55 | 190 | 57% | - |
The system adds hardware, but efficiency falls sharply at larger cluster sizes. You need to determine which metrics best capture this behavior and whether the bottleneck is parallel overhead, communication, or poor resource utilization.