Design a Fair Cross-Hardware Benchmark

Context

NimbusAI serves a text-generation API and is evaluating a new 7B-parameter model for production. The team tested the model on three hardware platforms, but leadership does not trust the results because throughput and latency vary materially by device, software stack, and batch settings.

Current Performance

Platform	GPU/Accelerator	Batch Size	Precision	Accuracy	F1 Score	AUC-ROC	P50 Latency (ms)	Tokens/sec	Power (W)
A	NVIDIA A100 80GB	16	0.88	0.91	0.84	0.94	118	1,920	285
B	NVIDIA H100 80GB	32	0.88	0.91	0.84	0.94	81	3,420	335
C	TPU v5e	64	0.87	0.90	0.82	0.93	74	3,980	410
D	AMD MI300X	32	0.88	0.91	0.83	0.94	96	2,760	360

The Problem

The model quality metrics are similar, but the benchmark is not reproducible or obviously fair: each platform used different batch sizes, kernels, quantization settings, and warm-up durations. Some runs excluded tokenization time; others reported steady-state throughput only. Your task is to redesign the benchmark so results can be compared credibly across hardware platforms.

Requirements

Identify which parts of the current benchmark make the comparison unfair or non-reproducible.
Define a benchmark protocol that standardizes workload, software, and measurement methodology.
Recommend which quality and system metrics should be reported together and why.
Explain how to handle unavoidable hardware-specific optimizations without biasing the comparison.
Propose a validation plan to ensure repeated runs produce stable results.

Constraints

NimbusAI must support both online inference (P95 latency < 150 ms) and batch summarization jobs.
The benchmark must be runnable by external partners with limited access to proprietary tooling.
Engineering can afford at most 5 repeated runs per platform per benchmark configuration.

Context

Current Performance

Platform	GPU/Accelerator	Batch Size	Precision	Accuracy	F1 Score	AUC-ROC	P50 Latency (ms)	Tokens/sec	Power (W)
A	NVIDIA A100 80GB	16	0.88	0.91	0.84	0.94	118	1,920	285
B	NVIDIA H100 80GB	32	0.88	0.91	0.84	0.94	81	3,420	335
C	TPU v5e	64	0.87	0.90	0.82	0.93	74	3,980	410
D	AMD MI300X	32	0.88	0.91	0.83	0.94	96	2,760	360

The Problem

Requirements

Identify which parts of the current benchmark make the comparison unfair or non-reproducible.
Define a benchmark protocol that standardizes workload, software, and measurement methodology.
Recommend which quality and system metrics should be reported together and why.
Explain how to handle unavoidable hardware-specific optimizations without biasing the comparison.
Propose a validation plan to ensure repeated runs produce stable results.

Constraints

NimbusAI must support both online inference (P95 latency < 150 ms) and batch summarization jobs.
The benchmark must be runnable by external partners with limited access to proprietary tooling.
Engineering can afford at most 5 repeated runs per platform per benchmark configuration.

Context

Current Performance

Platform	GPU/Accelerator	Batch Size	Precision	Accuracy	F1 Score	AUC-ROC	P50 Latency (ms)	Tokens/sec	Power (W)
A	NVIDIA A100 80GB	16	0.88	0.91	0.84	0.94	118	1,920	285
B	NVIDIA H100 80GB	32	0.88	0.91	0.84	0.94	81	3,420	335
C	TPU v5e	64	0.87	0.90	0.82	0.93	74	3,980	410
D	AMD MI300X	32	0.88	0.91	0.83	0.94	96	2,760	360

The Problem

Requirements

Identify which parts of the current benchmark make the comparison unfair or non-reproducible.
Define a benchmark protocol that standardizes workload, software, and measurement methodology.
Recommend which quality and system metrics should be reported together and why.
Explain how to handle unavoidable hardware-specific optimizations without biasing the comparison.
Propose a validation plan to ensure repeated runs produce stable results.

Constraints

NimbusAI must support both online inference (P95 latency < 150 ms) and batch summarization jobs.
The benchmark must be runnable by external partners with limited access to proprietary tooling.
Engineering can afford at most 5 repeated runs per platform per benchmark configuration.

Context

Current Performance

Platform	GPU/Accelerator	Batch Size	Precision	Accuracy	F1 Score	AUC-ROC	P50 Latency (ms)	Tokens/sec	Power (W)
A	NVIDIA A100 80GB	16	0.88	0.91	0.84	0.94	118	1,920	285
B	NVIDIA H100 80GB	32	0.88	0.91	0.84	0.94	81	3,420	335
C	TPU v5e	64	0.87	0.90	0.82	0.93	74	3,980	410
D	AMD MI300X	32	0.88	0.91	0.83	0.94	96	2,760	360

The Problem

Requirements

Identify which parts of the current benchmark make the comparison unfair or non-reproducible.
Define a benchmark protocol that standardizes workload, software, and measurement methodology.
Recommend which quality and system metrics should be reported together and why.
Explain how to handle unavoidable hardware-specific optimizations without biasing the comparison.
Propose a validation plan to ensure repeated runs produce stable results.

Constraints

NimbusAI must support both online inference (P95 latency < 150 ms) and batch summarization jobs.
The benchmark must be runnable by external partners with limited access to proprietary tooling.
Engineering can afford at most 5 repeated runs per platform per benchmark configuration.

Interview Guides

Context

Current Performance

The Problem

Requirements

Constraints

Design a Fair Cross-Hardware Benchmark

Context

Current Performance

The Problem

Requirements

Constraints

Your Answer

Design a Fair Cross-Hardware Benchmark

Context

Current Performance

The Problem

Requirements

Constraints

Design a Fair Cross-Hardware Benchmark

Context

Current Performance

The Problem

Requirements

Constraints

Your Answer