Evaluate Distributed Inference Scaling Metrics

Context

VectorServe runs a distributed recommendation inference service across GPU nodes. The team recently upgraded the serving stack from 8 to 64 GPUs, but product latency complaints increased and infrastructure cost rose faster than throughput. Leadership wants a clear evaluation of throughput, latency, utilization, strong scaling, weak scaling, and the gap between observed speedup and Amdahl’s law.

Current Performance

Configuration	GPUs	Global Batch	Throughput (req/s)	P50 Latency (ms)	P95 Latency (ms)	GPU Utilization	Speedup vs 8 GPUs
Baseline	8	8,000	12,000	42	95	78%	1.00x
Strong scaling test	16	8,000	20,400	39	110	71%	1.70x
Strong scaling test	32	8,000	31,200	37	145	62%	2.60x
Strong scaling test	64	8,000	40,800	36	210	49%	3.40x
Weak scaling test	16	16,000	23,600	44	108	75%	-
Weak scaling test	32	32,000	45,100	47	132	68%	-
Weak scaling test	64	64,000	83,500	55	190	57%	-

The Problem

The system adds hardware, but efficiency falls sharply at larger cluster sizes. You need to determine which metrics best capture this behavior and whether the bottleneck is parallel overhead, communication, or poor resource utilization.

Requirements

Define the primary metrics you would use for throughput, latency, utilization, strong scaling, weak scaling, and Amdahl’s law.
Interpret the results above and quantify scaling efficiency at each cluster size.
Estimate the implied serial fraction from the observed speedup at 64 GPUs.
Identify the most likely bottlenecks and the tradeoffs between throughput and tail latency.
Recommend concrete measurement and optimization steps.

Constraints

P95 latency must stay below 160 ms for production traffic.
GPU fleet cost doubles when moving from 32 to 64 GPUs.
The service must support both fixed-load and proportional-load scaling tests.

Context

Current Performance

Configuration	GPUs	Global Batch	Throughput (req/s)	P50 Latency (ms)	P95 Latency (ms)	GPU Utilization	Speedup vs 8 GPUs
Baseline	8	8,000	12,000	42	95	78%	1.00x
Strong scaling test	16	8,000	20,400	39	110	71%	1.70x
Strong scaling test	32	8,000	31,200	37	145	62%	2.60x
Strong scaling test	64	8,000	40,800	36	210	49%	3.40x
Weak scaling test	16	16,000	23,600	44	108	75%	-
Weak scaling test	32	32,000	45,100	47	132	68%	-
Weak scaling test	64	64,000	83,500	55	190	57%	-

The Problem

Requirements

Define the primary metrics you would use for throughput, latency, utilization, strong scaling, weak scaling, and Amdahl’s law.
Interpret the results above and quantify scaling efficiency at each cluster size.
Estimate the implied serial fraction from the observed speedup at 64 GPUs.
Identify the most likely bottlenecks and the tradeoffs between throughput and tail latency.
Recommend concrete measurement and optimization steps.

Constraints

P95 latency must stay below 160 ms for production traffic.
GPU fleet cost doubles when moving from 32 to 64 GPUs.
The service must support both fixed-load and proportional-load scaling tests.

Context

Current Performance

Configuration	GPUs	Global Batch	Throughput (req/s)	P50 Latency (ms)	P95 Latency (ms)	GPU Utilization	Speedup vs 8 GPUs
Baseline	8	8,000	12,000	42	95	78%	1.00x
Strong scaling test	16	8,000	20,400	39	110	71%	1.70x
Strong scaling test	32	8,000	31,200	37	145	62%	2.60x
Strong scaling test	64	8,000	40,800	36	210	49%	3.40x
Weak scaling test	16	16,000	23,600	44	108	75%	-
Weak scaling test	32	32,000	45,100	47	132	68%	-
Weak scaling test	64	64,000	83,500	55	190	57%	-

The Problem

Requirements

Define the primary metrics you would use for throughput, latency, utilization, strong scaling, weak scaling, and Amdahl’s law.
Interpret the results above and quantify scaling efficiency at each cluster size.
Estimate the implied serial fraction from the observed speedup at 64 GPUs.
Identify the most likely bottlenecks and the tradeoffs between throughput and tail latency.
Recommend concrete measurement and optimization steps.

Constraints

P95 latency must stay below 160 ms for production traffic.
GPU fleet cost doubles when moving from 32 to 64 GPUs.
The service must support both fixed-load and proportional-load scaling tests.

Context

Current Performance

Configuration	GPUs	Global Batch	Throughput (req/s)	P50 Latency (ms)	P95 Latency (ms)	GPU Utilization	Speedup vs 8 GPUs
Baseline	8	8,000	12,000	42	95	78%	1.00x
Strong scaling test	16	8,000	20,400	39	110	71%	1.70x
Strong scaling test	32	8,000	31,200	37	145	62%	2.60x
Strong scaling test	64	8,000	40,800	36	210	49%	3.40x
Weak scaling test	16	16,000	23,600	44	108	75%	-
Weak scaling test	32	32,000	45,100	47	132	68%	-
Weak scaling test	64	64,000	83,500	55	190	57%	-

The Problem

Requirements

Define the primary metrics you would use for throughput, latency, utilization, strong scaling, weak scaling, and Amdahl’s law.
Interpret the results above and quantify scaling efficiency at each cluster size.
Estimate the implied serial fraction from the observed speedup at 64 GPUs.
Identify the most likely bottlenecks and the tradeoffs between throughput and tail latency.
Recommend concrete measurement and optimization steps.

Constraints

P95 latency must stay below 160 ms for production traffic.
GPU fleet cost doubles when moving from 32 to 64 GPUs.
The service must support both fixed-load and proportional-load scaling tests.

Interview Guides

Context

Current Performance

The Problem

Requirements

Constraints

Evaluate Distributed Inference Scaling Metrics

Context

Current Performance

The Problem

Requirements

Constraints

Your Answer

Evaluate Distributed Inference Scaling Metrics

Context

Current Performance

The Problem

Requirements

Constraints

Evaluate Distributed Inference Scaling Metrics

Context

Current Performance

The Problem

Requirements

Constraints

Your Answer