Design a Fair Cross-Hardware Benchmark

Medium

Model Evaluation

Asked at 1 company1AccuracyPrecisionRecall

Also asked at

Problem

Context

NimbusAI serves a text-generation API and is evaluating a new 7B-parameter model for production. The team tested the model on three hardware platforms, but leadership does not trust the results because throughput and latency vary materially by device, software stack, and batch settings.

Current Performance

Platform	GPU/Accelerator	Batch Size	Precision	Accuracy	F1 Score	AUC-ROC	P50 Latency (ms)	Tokens/sec	Power (W)
A	NVIDIA A100 80GB	16	0.88	0.91	0.84	0.94	118	1,920	285
B	NVIDIA H100 80GB	32	0.88	0.91	0.84	0.94	81	3,420	335
C	TPU v5e	64	0.87	0.90	0.82	0.93	74	3,980	410
D	AMD MI300X	32	0.88	0.91	0.83	0.94	96	2,760	360

The Problem

The model quality metrics are similar, but the benchmark is not reproducible or obviously fair: each platform used different batch sizes, kernels, quantization settings, and warm-up durations. Some runs excluded tokenization time; others reported steady-state throughput only. Your task is to redesign the benchmark so results can be compared credibly across hardware platforms.

Requirements

Identify which parts of the current benchmark make the comparison unfair or non-reproducible.
Define a benchmark protocol that standardizes workload, software, and measurement methodology.
Recommend which quality and system metrics should be reported together and why.
Explain how to handle unavoidable hardware-specific optimizations without biasing the comparison.
Propose a validation plan to ensure repeated runs produce stable results.

Constraints

NimbusAI must support both online inference (P95 latency < 150 ms) and batch summarization jobs.
The benchmark must be runnable by external partners with limited access to proprietary tooling.
Engineering can afford at most 5 repeated runs per platform per benchmark configuration.

Problem

Context

Current Performance

Platform	GPU/Accelerator	Batch Size	Precision	Accuracy	F1 Score	AUC-ROC	P50 Latency (ms)	Tokens/sec	Power (W)
A	NVIDIA A100 80GB	16	0.88	0.91	0.84	0.94	118	1,920	285
B	NVIDIA H100 80GB	32	0.88	0.91	0.84	0.94	81	3,420	335
C	TPU v5e	64	0.87	0.90	0.82	0.93	74	3,980	410
D	AMD MI300X	32	0.88	0.91	0.83	0.94	96	2,760	360

The Problem

Requirements

Identify which parts of the current benchmark make the comparison unfair or non-reproducible.
Define a benchmark protocol that standardizes workload, software, and measurement methodology.
Recommend which quality and system metrics should be reported together and why.
Explain how to handle unavoidable hardware-specific optimizations without biasing the comparison.
Propose a validation plan to ensure repeated runs produce stable results.

Constraints

NimbusAI must support both online inference (P95 latency < 150 ms) and batch summarization jobs.
The benchmark must be runnable by external partners with limited access to proprietary tooling.
Engineering can afford at most 5 repeated runs per platform per benchmark configuration.

Your answer

Try one AI text evaluation on us

Get structured feedback, scored against a 4-axis rubric. Premium unlocks unlimited.

0 wordstarget ~200

Up next

Evaluate Distributed Inference Scaling MetricsMedium

Diagnose Slow Multi-GPU Vector SearchHard

Design ML Serving Platform for Vehicle GPUsHard

Next question

Platform

GPU/Accelerator

Batch Size

Precision

Accuracy

F1 Score

AUC-ROC

P50 Latency (ms)

Tokens/sec

Power (W)

NVIDIA A100 80GB

0.88

0.91

0.84

0.94

118

1,920

285

NVIDIA H100 80GB

0.88

0.91

0.84

0.94

3,420

335

TPU v5e

0.87

0.90

0.82

0.93

3,980

410

AMD MI300X

0.88

0.91

0.83

0.94

2,760

360

The Problem

Requirements

Identify which parts of the current benchmark make the comparison unfair or non-reproducible.

Define a benchmark protocol that standardizes workload, software, and measurement methodology.

Recommend which quality and system metrics should be reported together and why.

Explain how to handle unavoidable hardware-specific optimizations without biasing the comparison.

Propose a validation plan to ensure repeated runs produce stable results.

Platform

GPU/Accelerator

Batch Size

Precision

Accuracy

F1 Score

AUC-ROC

P50 Latency (ms)

Tokens/sec

Power (W)

NVIDIA A100 80GB

0.88

0.91

0.84

0.94

118

1,920

285

NVIDIA H100 80GB

0.88

0.91

0.84

0.94

3,420

335

TPU v5e

0.87

0.90

0.82

0.93

3,980

410

AMD MI300X

0.88

0.91

0.83

0.94

2,760

360

The Problem

Requirements

Identify which parts of the current benchmark make the comparison unfair or non-reproducible.

Define a benchmark protocol that standardizes workload, software, and measurement methodology.

Recommend which quality and system metrics should be reported together and why.

Explain how to handle unavoidable hardware-specific optimizations without biasing the comparison.

Propose a validation plan to ensure repeated runs produce stable results.