Last9 GPU Telemetry (l9gpu) provides full-stack observability across 8 layers, from silicon to business metrics. Vendor-agnostic. NVIDIA, AMD, Intel Gaudi. Kubernetes and Slurm. Every major inference engine.
What's happening inside each GPU right now?
Sources: NVML API (NVIDIA), amdsmi (AMD), hl-smi (Intel Gaudi), DCGM Exporter
| Category | Metrics | Why It Matters |
|---|---|---|
| Compute | GPU utilization, SM active ratio, SM occupancy, tensor core activity, FP16/FP32/FP64 pipe activity | Distinguish between "GPU busy" and "GPU doing useful work" — utilization alone is misleading |
| Memory | VRAM used/free/total, HBM bandwidth saturation, memory controller utilization | Catch OOM before it crashes your job. Know when HBM bandwidth is the bottleneck |
| Interconnect | NVLink TX/RX throughput, PCIe TX/RX throughput, XGMI per-link bandwidth (AMD), per-port RoCE bandwidth (Gaudi) | Multi-GPU bottleneck detection. Is NCCL limited by NVLink or PCIe? |
| Power & Thermal | Power draw (W), temperature (edge/junction/HBM), clock frequency, throttle reasons, fan speed, energy consumption (mJ) | Detect thermal throttling before it impacts latency. Energy attribution for cost and carbon |
| Reliability | ECC errors (correctable/uncorrectable), XID errors, retired pages, row remapping, PCIe replay counter | Silent degradation detection. ECC trends predict failure 48-72 hours ahead |
Fleet Health Signals (derived from L1):
| Signal | What It Detects | Alert Threshold |
|---|---|---|
gpu.health.score (0-100) |
Composite: ECC + XID + thermal + PCIe | Warning < 80, Critical < 50 |
gpu.ecc.sbe_rate |
Single-bit error rate trending up | > 10/hour |
gpu.xid.error_rate |
XID event frequency (XID 79 = GPU fell off bus) | > 0/hour |
gpu.pcie.link.downtraining |
PCIe Gen5 x16 → Gen3 x8 (massive bandwidth loss) | Any occurrence |
gpu.thermal.ramp_rate |
Temperature rising > 2 C/min (cooling failure) | > 2.0 C/min |
MIG Support: Per-MIG-instance metrics with gpu.mig.instance_id attribution. Automatic fallback to GR_ENGINE_ACTIVE when standard utilization returns 0 on MIG-enabled GPUs.
Unified Memory (GH200/GB200): Automatic detection and reporting of unified CPU+GPU memory pools on Grace-Hopper and Grace-Blackwell architectures.
What are the GPUs actually executing? Where is multi-GPU communication stuck?
Sources: NCCL Inspector (production-ready log parser), xpu-perf eBPF+CUPTI profiler
| Category | Metrics | Why It Matters |
|---|---|---|
| NCCL Collectives | AllReduce/AllGather/ReduceScatter bandwidth, bus bandwidth, duration, message size | Is distributed training limited by communication? Which collective is the bottleneck? |
| Straggler Detection | Per-rank duration vs median. Flag if rank > 1s behind | One slow GPU drags down the entire training job |
| CUDA Kernels | Per-kernel execution time, call count, p99 duration | Which kernels dominate GPU time? Trace-level profiling via OTel spans |
Is the host the bottleneck, not the GPU?
Sources: OTel Collector hostmetrics receiver (shipped as Helm ConfigMap)
| Category | Metrics | Why It Matters |
|---|---|---|
| CPU | Utilization per core, I/O wait, context switches | High I/O wait = DataLoader bottleneck during prefill |
| Memory | Usage by state, available, swap usage | Any swap on a GPU node = immediate problem |
| Disk | Read/write bytes, I/O time, operations | Model loading and checkpointing throughput |
| Network | TX/RX bytes, errors, dropped packets, TCP retransmits | TCP retransmits = tensor parallelism degradation |
| Process | Per-process CPU, memory RSS, thread count, GC pauses | Isolate which serving process is misbehaving |
Which pod owns which GPU? What's the K8s context?
Sources: Custom Go OTel processor (k8sprocessor), kubeletstats receiver, k8s_cluster receiver
| Category | Attributes Enriched | Why It Matters |
|---|---|---|
| GPU-to-Pod mapping | k8s.pod.name, k8s.namespace.name, k8s.container.name |
Every GPU metric automatically tagged with the pod using it |
| Workload owners | k8s.deployment.name, k8s.job.name, k8s.statefulset.name |
Roll up GPU metrics by deployment or training job |
| Cloud topology | cloud.availability_zone, cloud.region |
Regional cost and performance analysis |
| Pod labels | app, app.kubernetes.io/* (configurable) |
Custom grouping by team, model, experiment |
| Container health | CPU/memory usage, OOM kills, restarts | Detect container-level issues separate from GPU issues |
| Slurm jobs | Job ID, user, partition, GPU allocation | Full Slurm job correlation for HPC clusters |
How it works: The k8sprocessor queries the K8s API for running pods on each node, maps GPU ordinals to pods by counting nvidia.com/gpu, amd.com/gpu, and habana.ai/gaudi resource requests, then injects pod metadata as data-point attributes on every GPU metric. 60-second cache for efficiency.
How efficient is your training job? Where are the bottlenecks?
Sources: l9gpu.training Python library (PyTorch hooks — import in your training script)
| Category | Metrics | Why It Matters |
|---|---|---|
| Compute Efficiency | MFU (Model FLOPs Utilization), achieved TFLOPS, step time | LLMs typically achieve 35-55% MFU. Below 30% = something is wrong |
| Gradient Health | L2 norm, NaN/Inf count, clipping rate | Gradient spikes predict training instability. NaN = immediate intervention |
| Loss | Training loss value per step | Loss curve anomaly detection |
| DataLoader | Time blocked waiting for next batch | High wait = CPU/IO is the bottleneck, not GPU |
| Checkpoint I/O | Save/restore duration, save bandwidth (bytes/s) | Checkpointing can dominate training time at scale |
# 4 lines to instrument any PyTorch training loop
from l9gpu.training import L9GPUTrainingMonitor
monitor = L9GPUTrainingMonitor(
otlp_endpoint="http://otel-collector:4317",
num_params=70_000_000_000, tokens_per_step=4096,
gpu_count=8, peak_tflops_per_gpu=989.0,
)How is your LLM serving performing? Where is latency coming from?
Sources: Prometheus endpoints from 5 inference engines
| Engine | Endpoint | Metrics Collected |
|---|---|---|
| vLLM | :8000/metrics |
24 fields: ITL, TTFT, prefill/decode split, cache hit rate, spec decode, preemptions, LoRA |
| NVIDIA NIM | :8000/metrics |
9 fields: latency, batch size, queue depth, KV cache, ITL |
| Triton | :8002/metrics |
11 fields: per-model latency breakdown (request/queue/compute), batch efficiency |
| SGLang | :30000/metrics |
18 fields: throughput, TTFT, ITL, RadixAttention cache, queue depths |
| TGI | :8080/metrics |
20 fields: request/queue/inference latency, TPOT, batch size, token distributions |
| Metric | What It Measures | SLO Target |
|---|---|---|
| TTFT (Time to First Token) | User-perceived responsiveness | P99 < 500ms (interactive), < 2s (batch) |
| ITL (Inter-Token Latency) | Streaming smoothness | P99 < 100ms |
| Prefill Duration | Prompt processing time | Proportional to input length |
| Decode Duration | Token generation time | Proportional to output length |
| Queue Wait | Time waiting before inference starts | < 100ms at target load |
| E2E Latency | Total request duration | P95 < 2s (chat), < 10s (long-form) |
| Metric | What It Tells You | Alert Threshold |
|---|---|---|
*.cache.usage |
KV cache block utilization | Warning > 80%, Critical > 92% |
*.cache.hit_rate |
Prefix cache reuse (vLLM, SGLang) | Low = wasted prefill compute |
*.cache.evictions |
Cache pressure | > 0/min = approaching capacity |
| Metric | What It Tells You |
|---|---|
*.scheduler.preemptions |
Continuous batching evicting requests (> 10/min = critical) |
*.spec_decode.acceptance_rate |
Speculative decoding draft token acceptance |
*.spec_decode.efficiency |
Mean accepted tokens per draft (higher = better) |
*.lora.active_count |
Number of loaded LoRA adapters |
*.requests.finished by finish_reason |
Stop vs length vs abort breakdown |
Standard LLM observability attributes across any provider.
Sources: OTel GenAI semantic conventions (opt-in --emit-genai-namespace)
All inference metrics are emitted under the OpenTelemetry gen_ai.* namespace in addition to their engine-specific names. This enables multi-vendor dashboards.
| gen_ai.* Metric | Maps From | Attribute |
|---|---|---|
gen_ai.client.token.usage |
throughput from any engine | gen_ai.token.type=input/output |
gen_ai.server.request.duration |
e2e latency from any engine | quantile=p50/p95/p99 |
gen_ai.server.time_to_first_token |
TTFT from any engine | quantile=p50/p95 |
gen_ai.server.time_per_output_token |
ITL from any engine | quantile=p50/p95 |
gen_ai.server.cache.utilization |
KV cache usage | gen_ai.cache.type=gpu/cpu |
gen_ai.provider.name |
resource attribute | vllm, nvidia_nim, triton, sglang, huggingface_tgi |
What does this GPU time actually cost? Is it being used efficiently?
Sources: cost_monitor (combines GPU power + cloud pricing + inference throughput)
| Metric | Unit | How It's Calculated |
|---|---|---|
gpu.cost.per_gpu_hour |
USD/h | Auto-detected from EC2 instance type (IMDSv2) or configured |
gpu.cost.per_prompt_token |
USD/token | cost_rate / prompt_tokens_per_sec |
gpu.cost.per_generation_token |
USD/token | cost_rate / generation_tokens_per_sec |
gpu.cost.idle_rate |
USD/s | Cost accruing when GPU utilization < 5% |
| Metric | Unit | What It Tracks |
|---|---|---|
gpu.efficiency.tokens_per_watt |
tokens/W | Higher = better. Optimize for inference efficiency |
gpu.efficiency.joules_per_token |
J/token | Lower = better. Energy cost per token |
gpu.energy.co2_rate |
g/s | CO2 emission rate (configurable grid intensity + PUE) |
| Metric | Warning | Critical |
|---|---|---|
| GPU temperature | > 80 C | > 90 C |
| VRAM usage | > 85% | > 95% |
| ECC double-bit errors | > 0 | > 0 (page immediately) |
| KV cache usage | > 80% | > 92% |
| TTFT P99 | > 1s | > 3s |
| ITL P99 | > 100ms | > 250ms |
| Scheduler preemptions | > 0/min | > 10/min |
| Health score | < 80 | < 50 |
| Idle GPU cost | > 10% hours | > 25% hours |
| Symptom | Check L1 | Check L3 | Check L6 | Root Cause |
|---|---|---|---|---|
| High TTFT | SM_OCCUPANCY low | CPU I/O wait high | Queue depth high | CPU bottleneck during prefill |
| Latency spike | GPU_UTIL normal | Memory OK | KV cache > 85% | KV cache pressure causing preemptions |
| Low throughput | TENSOR_ACTIVE low | Network retransmits | Batch size small | NCCL bottleneck (multi-GPU) |
| OOM crash | VRAM at limit | — | errors{type=oom} |
Model too large or batch too big |
| Throttling | CLOCK_THROTTLE set | — | — | Thermal or power cap hit |
| Silent degradation | ECC_DBE increasing | — | — | Failing GPU memory. Retire ASAP |
| Training stall | All GPUs idle | — | — | NCCL straggler on one rank |