Dataford
Interview Guides
Upgrade
All questions/Model Evaluation/Design a Fair Cross-Hardware Benchmark

Design a Fair Cross-Hardware Benchmark

Medium
Model Evaluation
Asked at 1 company1AccuracyPrecisionRecall
Also asked at
NVIDIA

Problem

Context

NimbusAI serves a text-generation API and is evaluating a new 7B-parameter model for production. The team tested the model on three hardware platforms, but leadership does not trust the results because throughput and latency vary materially by device, software stack, and batch settings.

Current Performance

PlatformGPU/AcceleratorBatch SizePrecisionAccuracyF1 ScoreAUC-ROCP50 Latency (ms)Tokens/secPower (W)
ANVIDIA A100 80GB160.880.910.840.941181,920285
BNVIDIA H100 80GB320.880.910.840.94813,420335
CTPU v5e640.870.900.820.93743,980410
DAMD MI300X320.880.910.830.94962,760360

The Problem

The model quality metrics are similar, but the benchmark is not reproducible or obviously fair: each platform used different batch sizes, kernels, quantization settings, and warm-up durations. Some runs excluded tokenization time; others reported steady-state throughput only. Your task is to redesign the benchmark so results can be compared credibly across hardware platforms.

Requirements

  1. Identify which parts of the current benchmark make the comparison unfair or non-reproducible.
  2. Define a benchmark protocol that standardizes workload, software, and measurement methodology.
  3. Recommend which quality and system metrics should be reported together and why.
  4. Explain how to handle unavoidable hardware-specific optimizations without biasing the comparison.
  5. Propose a validation plan to ensure repeated runs produce stable results.

Constraints

  • NimbusAI must support both online inference (P95 latency < 150 ms) and batch summarization jobs.
  • The benchmark must be runnable by external partners with limited access to proprietary tooling.
  • Engineering can afford at most 5 repeated runs per platform per benchmark configuration.

Problem

Context

NimbusAI serves a text-generation API and is evaluating a new 7B-parameter model for production. The team tested the model on three hardware platforms, but leadership does not trust the results because throughput and latency vary materially by device, software stack, and batch settings.

Current Performance

PlatformGPU/AcceleratorBatch SizePrecisionAccuracyF1 ScoreAUC-ROCP50 Latency (ms)Tokens/secPower (W)
ANVIDIA A100 80GB160.880.910.840.941181,920285
BNVIDIA H100 80GB320.880.910.840.94813,420335
CTPU v5e640.870.900.820.93743,980410
DAMD MI300X320.880.910.830.94962,760360

The Problem

The model quality metrics are similar, but the benchmark is not reproducible or obviously fair: each platform used different batch sizes, kernels, quantization settings, and warm-up durations. Some runs excluded tokenization time; others reported steady-state throughput only. Your task is to redesign the benchmark so results can be compared credibly across hardware platforms.

Requirements

  1. Identify which parts of the current benchmark make the comparison unfair or non-reproducible.
  2. Define a benchmark protocol that standardizes workload, software, and measurement methodology.
  3. Recommend which quality and system metrics should be reported together and why.
  4. Explain how to handle unavoidable hardware-specific optimizations without biasing the comparison.
  5. Propose a validation plan to ensure repeated runs produce stable results.

Constraints

  • NimbusAI must support both online inference (P95 latency < 150 ms) and batch summarization jobs.
  • The benchmark must be runnable by external partners with limited access to proprietary tooling.
  • Engineering can afford at most 5 repeated runs per platform per benchmark configuration.
Your answer
Try one AI text evaluation on us
Get structured feedback, scored against a 4-axis rubric. Premium unlocks unlimited.
0 wordstarget ~200
Up next
NVIDIAEvaluate Distributed Inference Scaling MetricsMediumNVIDIADiagnose Slow Multi-GPU Vector SearchHardZooxDesign ML Serving Platform for Vehicle GPUsHard
Next question