Monitor GPU Cluster Reliability KPIs

Business Context

You’re the analytics lead supporting NVIDIA DGX-based GPU training infrastructure at a fintech company that runs real-time fraud models and large-batch offline retraining. The platform serves ~1,200 GPUs across 80 nodes in two regions, executes ~35,000 training jobs/day, and backs workloads that influence $4B/year in payment volume. A recent quarter saw multiple high-severity incidents: training pipelines missing SLAs, GPU utilization dropping unexpectedly, and a spike in job failures after a driver upgrade. Leadership wants a monitoring solution that is not just dashboards, but a metric framework that reliably detects issues early, attributes root causes, and ties to business impact.

Metric Scenario

Stakeholders (Infra Eng, ML Platform, and Finance) are asking:

“Are we using GPUs efficiently, or are we wasting spend?”
“When training slows down, is it due to GPU contention, data I/O, network, or scheduler behavior?”
“How do we catch regressions within 15 minutes of a deploy?”
“What should be the north-star KPI for the cluster, and what guardrails prevent gaming?”

You have access to system telemetry (GPU/CPU/memory), scheduler events (Kubernetes + Slurm-like queues), and job-level logs from the ML platform. You also have cost and capacity data from Finance.

Data Available

Source	What it contains	Grain
gpu_telemetry	per-GPU utilization, memory used, power draw, temperature, ECC errors, throttling flags	GPU-minute
node_telemetry	CPU, RAM, disk I/O, local NVMe, kernel/driver versions, node health	Node-minute
network_telemetry	RDMA/InfiniBand throughput, packet loss, retransmits, latency	Link-minute
scheduler_events	job submitted/started/ended, queue, priority, preemption, node placement, retries	Event
job_runtime_metrics	step time, data loader time, checkpoint time, gradient sync time, framework version	Job-step
job_metadata	team, model type, dataset, requested GPUs, requested memory, container image	Job
cost_capacity	GPU-hour cost, reserved vs on-demand, cluster capacity by region	Day

Your Task (what you must produce)

Define a north-star metric for cluster “health” that balances reliability and efficiency. Explain why it is the right top-level KPI for this business.
Specify 6–10 supporting KPIs (leading + lagging) with clear definitions, ownership (Infra vs ML Platform), and alerting intent (page vs ticket vs dashboard).
Provide a metric decomposition for diagnosing a sudden drop in the north-star metric (e.g., -8% in 24 hours). Your decomposition must isolate whether the cause is:
- hardware (ECC, thermal throttling),
- software (driver/framework regressions),
- scheduling/queuing (priority changes, fragmentation),
- data pipeline / I/O bottlenecks,
- network (collective comms degradation).
Propose alert thresholds and baselines: what do you compare against (rolling 7-day, per-hardware SKU, per-workload class)? How do you avoid false positives during known seasonality (weekday retrains) or planned maintenance?
Recommend actions you would take given 2 example findings:
- (A) utilization is flat but job completion SLA misses spike
- (B) utilization drops and queue time increases while node health looks normal

Constraints:

You must detect high-severity regressions within 15 minutes.
Metrics must be robust to teams “gaming” utilization (e.g., running dummy workloads).
The cluster supports both short interactive jobs and multi-day distributed training.

Business Context

Metric Scenario

Stakeholders (Infra Eng, ML Platform, and Finance) are asking:

“Are we using GPUs efficiently, or are we wasting spend?”
“When training slows down, is it due to GPU contention, data I/O, network, or scheduler behavior?”
“How do we catch regressions within 15 minutes of a deploy?”
“What should be the north-star KPI for the cluster, and what guardrails prevent gaming?”

You have access to system telemetry (GPU/CPU/memory), scheduler events (Kubernetes + Slurm-like queues), and job-level logs from the ML platform. You also have cost and capacity data from Finance.

Data Available

Source	What it contains	Grain
gpu_telemetry	per-GPU utilization, memory used, power draw, temperature, ECC errors, throttling flags	GPU-minute
node_telemetry	CPU, RAM, disk I/O, local NVMe, kernel/driver versions, node health	Node-minute
network_telemetry	RDMA/InfiniBand throughput, packet loss, retransmits, latency	Link-minute
scheduler_events	job submitted/started/ended, queue, priority, preemption, node placement, retries	Event
job_runtime_metrics	step time, data loader time, checkpoint time, gradient sync time, framework version	Job-step
job_metadata	team, model type, dataset, requested GPUs, requested memory, container image	Job
cost_capacity	GPU-hour cost, reserved vs on-demand, cluster capacity by region	Day

Your Task (what you must produce)

Define a north-star metric for cluster “health” that balances reliability and efficiency. Explain why it is the right top-level KPI for this business.
Specify 6–10 supporting KPIs (leading + lagging) with clear definitions, ownership (Infra vs ML Platform), and alerting intent (page vs ticket vs dashboard).
Provide a metric decomposition for diagnosing a sudden drop in the north-star metric (e.g., -8% in 24 hours). Your decomposition must isolate whether the cause is:
- hardware (ECC, thermal throttling),
- software (driver/framework regressions),
- scheduling/queuing (priority changes, fragmentation),
- data pipeline / I/O bottlenecks,
- network (collective comms degradation).
Propose alert thresholds and baselines: what do you compare against (rolling 7-day, per-hardware SKU, per-workload class)? How do you avoid false positives during known seasonality (weekday retrains) or planned maintenance?
Recommend actions you would take given 2 example findings:
- (A) utilization is flat but job completion SLA misses spike
- (B) utilization drops and queue time increases while node health looks normal

Constraints:

You must detect high-severity regressions within 15 minutes.
Metrics must be robust to teams “gaming” utilization (e.g., running dummy workloads).
The cluster supports both short interactive jobs and multi-day distributed training.

Business Context

Metric Scenario

Stakeholders (Infra Eng, ML Platform, and Finance) are asking:

“Are we using GPUs efficiently, or are we wasting spend?”
“When training slows down, is it due to GPU contention, data I/O, network, or scheduler behavior?”
“How do we catch regressions within 15 minutes of a deploy?”
“What should be the north-star KPI for the cluster, and what guardrails prevent gaming?”

You have access to system telemetry (GPU/CPU/memory), scheduler events (Kubernetes + Slurm-like queues), and job-level logs from the ML platform. You also have cost and capacity data from Finance.

Data Available

Source	What it contains	Grain
gpu_telemetry	per-GPU utilization, memory used, power draw, temperature, ECC errors, throttling flags	GPU-minute
node_telemetry	CPU, RAM, disk I/O, local NVMe, kernel/driver versions, node health	Node-minute
network_telemetry	RDMA/InfiniBand throughput, packet loss, retransmits, latency	Link-minute
scheduler_events	job submitted/started/ended, queue, priority, preemption, node placement, retries	Event
job_runtime_metrics	step time, data loader time, checkpoint time, gradient sync time, framework version	Job-step
job_metadata	team, model type, dataset, requested GPUs, requested memory, container image	Job
cost_capacity	GPU-hour cost, reserved vs on-demand, cluster capacity by region	Day

Your Task (what you must produce)

Define a north-star metric for cluster “health” that balances reliability and efficiency. Explain why it is the right top-level KPI for this business.
Specify 6–10 supporting KPIs (leading + lagging) with clear definitions, ownership (Infra vs ML Platform), and alerting intent (page vs ticket vs dashboard).
Provide a metric decomposition for diagnosing a sudden drop in the north-star metric (e.g., -8% in 24 hours). Your decomposition must isolate whether the cause is:
- hardware (ECC, thermal throttling),
- software (driver/framework regressions),
- scheduling/queuing (priority changes, fragmentation),
- data pipeline / I/O bottlenecks,
- network (collective comms degradation).
Propose alert thresholds and baselines: what do you compare against (rolling 7-day, per-hardware SKU, per-workload class)? How do you avoid false positives during known seasonality (weekday retrains) or planned maintenance?
Recommend actions you would take given 2 example findings:
- (A) utilization is flat but job completion SLA misses spike
- (B) utilization drops and queue time increases while node health looks normal

Constraints:

You must detect high-severity regressions within 15 minutes.
Metrics must be robust to teams “gaming” utilization (e.g., running dummy workloads).
The cluster supports both short interactive jobs and multi-day distributed training.

Business Context

Metric Scenario

Stakeholders (Infra Eng, ML Platform, and Finance) are asking:

“Are we using GPUs efficiently, or are we wasting spend?”
“When training slows down, is it due to GPU contention, data I/O, network, or scheduler behavior?”
“How do we catch regressions within 15 minutes of a deploy?”
“What should be the north-star KPI for the cluster, and what guardrails prevent gaming?”

You have access to system telemetry (GPU/CPU/memory), scheduler events (Kubernetes + Slurm-like queues), and job-level logs from the ML platform. You also have cost and capacity data from Finance.

Data Available

Source	What it contains	Grain
gpu_telemetry	per-GPU utilization, memory used, power draw, temperature, ECC errors, throttling flags	GPU-minute
node_telemetry	CPU, RAM, disk I/O, local NVMe, kernel/driver versions, node health	Node-minute
network_telemetry	RDMA/InfiniBand throughput, packet loss, retransmits, latency	Link-minute
scheduler_events	job submitted/started/ended, queue, priority, preemption, node placement, retries	Event
job_runtime_metrics	step time, data loader time, checkpoint time, gradient sync time, framework version	Job-step
job_metadata	team, model type, dataset, requested GPUs, requested memory, container image	Job
cost_capacity	GPU-hour cost, reserved vs on-demand, cluster capacity by region	Day

Your Task (what you must produce)

Define a north-star metric for cluster “health” that balances reliability and efficiency. Explain why it is the right top-level KPI for this business.
Specify 6–10 supporting KPIs (leading + lagging) with clear definitions, ownership (Infra vs ML Platform), and alerting intent (page vs ticket vs dashboard).
Provide a metric decomposition for diagnosing a sudden drop in the north-star metric (e.g., -8% in 24 hours). Your decomposition must isolate whether the cause is:
- hardware (ECC, thermal throttling),
- software (driver/framework regressions),
- scheduling/queuing (priority changes, fragmentation),
- data pipeline / I/O bottlenecks,
- network (collective comms degradation).
Propose alert thresholds and baselines: what do you compare against (rolling 7-day, per-hardware SKU, per-workload class)? How do you avoid false positives during known seasonality (weekday retrains) or planned maintenance?
Recommend actions you would take given 2 example findings:
- (A) utilization is flat but job completion SLA misses spike
- (B) utilization drops and queue time increases while node health looks normal

Constraints:

You must detect high-severity regressions within 15 minutes.
Metrics must be robust to teams “gaming” utilization (e.g., running dummy workloads).
The cluster supports both short interactive jobs and multi-day distributed training.

Interview Guides

Business Context

Metric Scenario

Data Available

Your Task (what you must produce)

Monitor GPU Cluster Reliability KPIs

Business Context

Metric Scenario

Data Available

Your Task (what you must produce)

Your Answer

Monitor GPU Cluster Reliability KPIs

Business Context

Metric Scenario

Data Available

Your Task (what you must produce)

Monitor GPU Cluster Reliability KPIs

Business Context

Metric Scenario

Data Available

Your Task (what you must produce)

Your Answer