Business Context
You’re the analytics lead supporting NVIDIA DGX-based GPU training infrastructure at a fintech company that runs real-time fraud models and large-batch offline retraining. The platform serves ~1,200 GPUs across 80 nodes in two regions, executes ~35,000 training jobs/day, and backs workloads that influence $4B/year in payment volume. A recent quarter saw multiple high-severity incidents: training pipelines missing SLAs, GPU utilization dropping unexpectedly, and a spike in job failures after a driver upgrade. Leadership wants a monitoring solution that is not just dashboards, but a metric framework that reliably detects issues early, attributes root causes, and ties to business impact.
Metric Scenario
Stakeholders (Infra Eng, ML Platform, and Finance) are asking:
- “Are we using GPUs efficiently, or are we wasting spend?”
- “When training slows down, is it due to GPU contention, data I/O, network, or scheduler behavior?”
- “How do we catch regressions within 15 minutes of a deploy?”
- “What should be the north-star KPI for the cluster, and what guardrails prevent gaming?”
You have access to system telemetry (GPU/CPU/memory), scheduler events (Kubernetes + Slurm-like queues), and job-level logs from the ML platform. You also have cost and capacity data from Finance.
Data Available
| Source | What it contains | Grain |
|---|
| gpu_telemetry | per-GPU utilization, memory used, power draw, temperature, ECC errors, throttling flags | GPU-minute |
| node_telemetry | CPU, RAM, disk I/O, local NVMe, kernel/driver versions, node health | Node-minute |
| network_telemetry | RDMA/InfiniBand throughput, packet loss, retransmits, latency | Link-minute |
| scheduler_events | job submitted/started/ended, queue, priority, preemption, node placement, retries | Event |
| job_runtime_metrics | step time, data loader time, checkpoint time, gradient sync time, framework version | Job-step |
| job_metadata | team, model type, dataset, requested GPUs, requested memory, container image | Job |
| cost_capacity | GPU-hour cost, reserved vs on-demand, cluster capacity by region | Day |
Your Task (what you must produce)
- Define a north-star metric for cluster “health” that balances reliability and efficiency. Explain why it is the right top-level KPI for this business.
- Specify 6–10 supporting KPIs (leading + lagging) with clear definitions, ownership (Infra vs ML Platform), and alerting intent (page vs ticket vs dashboard).
- Provide a metric decomposition for diagnosing a sudden drop in the north-star metric (e.g., -8% in 24 hours). Your decomposition must isolate whether the cause is:
- hardware (ECC, thermal throttling),
- software (driver/framework regressions),
- scheduling/queuing (priority changes, fragmentation),
- data pipeline / I/O bottlenecks,
- network (collective comms degradation).
- Propose alert thresholds and baselines: what do you compare against (rolling 7-day, per-hardware SKU, per-workload class)? How do you avoid false positives during known seasonality (weekday retrains) or planned maintenance?
- Recommend actions you would take given 2 example findings:
- (A) utilization is flat but job completion SLA misses spike
- (B) utilization drops and queue time increases while node health looks normal
Constraints:
- You must detect high-severity regressions within 15 minutes.
- Metrics must be robust to teams “gaming” utilization (e.g., running dummy workloads).
- The cluster supports both short interactive jobs and multi-day distributed training.