NimbusAI serves a text-generation API and is evaluating a new 7B-parameter model for production. The team tested the model on three hardware platforms, but leadership does not trust the results because throughput and latency vary materially by device, software stack, and batch settings.
| Platform | GPU/Accelerator | Batch Size | Precision | Accuracy | F1 Score | AUC-ROC | P50 Latency (ms) | Tokens/sec | Power (W) |
|---|---|---|---|---|---|---|---|---|---|
| A | NVIDIA A100 80GB | 16 | 0.88 | 0.91 | 0.84 | 0.94 | 118 | 1,920 | 285 |
| B | NVIDIA H100 80GB | 32 | 0.88 | 0.91 | 0.84 | 0.94 | 81 | 3,420 | 335 |
| C | TPU v5e | 64 | 0.87 | 0.90 | 0.82 | 0.93 | 74 | 3,980 | 410 |
| D | AMD MI300X | 32 | 0.88 | 0.91 | 0.83 | 0.94 | 96 | 2,760 | 360 |
The model quality metrics are similar, but the benchmark is not reproducible or obviously fair: each platform used different batch sizes, kernels, quantization settings, and warm-up durations. Some runs excluded tokenization time; others reported steady-state throughput only. Your task is to redesign the benchmark so results can be compared credibly across hardware platforms.