Skip to content

WaffleBits/triton-inference-benchmark

Repository files navigation

Triton Inference Benchmark

CI

Benchmark harness for Triton-style model serving. The tool supports a dependency-free mock mode for CI and an optional live HTTP mode for testing a real inference server endpoint.

Why This Exists

Inference infrastructure work is not just "run a model." Strong systems need repeatable benchmarks, clear latency percentiles, failure accounting, configurable concurrency, and results that can be compared over time. This repo is a small but reviewable artifact around that workflow.

Features

  • Concurrent load generation with configurable request count and worker count.
  • Retry-aware request execution.
  • Latency metrics: average, p50, p95, p99, min, max.
  • Throughput and success-rate reporting.
  • Dependency-free mock mode for CI and reviewer demos.
  • Optional Triton HTTP mode for live model-serving benchmarks.
  • JSON output for trend tracking and regression analysis.
  • Prometheus text export for dashboard or CI artifact ingestion.
  • Baseline-versus-candidate comparison with configurable p95 and success-rate gates.
  • Correlated Triton/DCGM telemetry snapshots for GPU utilization, memory, queue, and server-duration context.
  • Batch-invariance probes that compare exact output fingerprints in isolation and under concurrent noise traffic.
  • Token-throughput and cost-to-serve estimates with explicit GPU price, power, and workload assumptions.
  • Kubernetes Job example for cluster-local benchmark runs.

Engineering Scope

This repo focuses on repeatable load generation, concurrency controls, retry accounting, percentile latency, throughput, and machine-readable output that can feed regression tracking.

Relevant areas:

  • AI infrastructure: model-serving reliability, latency analysis, failure accounting, and benchmark methodology.
  • Platform engineering: CLI design, JSON artifacts, CI-friendly mock mode, and extension points for live services.
  • Performance engineering: percentile metrics, concurrency sweeps, throughput measurement, and reproducible comparison paths.
  • Infrastructure/SRE: Prometheus-compatible benchmark artifacts, correlated server telemetry, release regression checks, Kubernetes job posture, and operations notes.

Reviewer Fast Path

  • Start with benchmark.py for benchmark orchestration and CLI behavior.
  • Review tests/ for metric and execution coverage.
  • Read DESIGN.md for benchmark tradeoffs and production extensions.
  • Read docs/OPERATIONS.md for regression triage, SLO-oriented checks, and Prometheus export usage.
  • Review sample_results/mock_telemetry.prom for the synthetic Triton/DCGM telemetry fixture.
  • Review deploy/kubernetes/benchmark-job.yaml for the cluster-run shape.
  • Read docs/PORTFOLIO_REVIEW.md for the technical review guide.

Quick Start

Run a local mock benchmark without GPU dependencies:

python benchmark.py --mode mock --num-requests 100 --concurrency 8

Write JSON plus Prometheus text-format artifacts:

python benchmark.py --mode mock --num-requests 500 --concurrency 32 --prometheus

Attach a synthetic Triton/DCGM telemetry snapshot to the benchmark result:

python benchmark.py \
  --mode mock \
  --num-requests 500 \
  --concurrency 32 \
  --telemetry-prometheus sample_results/mock_telemetry.prom \
  --prometheus

Check whether fixed inputs produce identical outputs when served alongside concurrent traffic:

python benchmark.py \
  --mode mock \
  --num-requests 100 \
  --concurrency 8 \
  --batch-invariance-probes 16 \
  --fail-on-batch-variance \
  --prometheus

Estimate token throughput, accelerator cost, energy, and normalized cost:

python benchmark.py \
  --mode mock \
  --num-requests 500 \
  --concurrency 32 \
  --input-tokens-per-request 1024 \
  --output-tokens-per-request 256 \
  --gpu-count 2 \
  --gpu-hourly-cost-usd 4.50 \
  --power-watts-per-gpu 600 \
  --electricity-cost-usd-per-kwh 0.12 \
  --prometheus

The estimate charges reserved GPU capacity for the full benchmark duration and normalizes cost by successful requests and tokens. Set electricity to zero when the hourly accelerator price already includes facility power.

Compare a candidate run against a saved baseline:

python benchmark.py \
  --mode mock \
  --num-requests 500 \
  --concurrency 32 \
  --baseline sample_results/mock_run.json \
  --max-p95-regression-pct 10 \
  --max-success-rate-drop 0.01 \
  --fail-on-regression \
  --prometheus

Run against a live Triton endpoint:

pip install -r requirements.txt
python benchmark.py \
  --mode triton \
  --server-url localhost:8000 \
  --model-name resnet50_trt_fp16 \
  --input-name input \
  --input-shape 1,3,224,224 \
  --num-requests 500 \
  --concurrency 32

Example Output

{
  "mode": "mock",
  "num_requests": 100,
  "concurrency": 8,
  "successful_requests": 100,
  "failed_requests": 0,
  "success_rate": 1.0,
  "throughput_rps": 305.42,
  "latency_ms": {
    "avg": 21.38,
    "p50": 21.74,
    "p95": 33.11,
    "p99": 34.67
  }
}

Test

python -m unittest discover -s tests

Design Notes

See DESIGN.md for the benchmark model, tradeoffs, and production extensions.

Engineering Notes

This project covers benchmarking discipline, model-serving concepts, latency percentiles, failure accounting, Prometheus-compatible artifacts, release regression checks, and a clean path from local mock testing to live inference measurement.

Gaps Worth Closing Next

  • Add warmup windows and separate cold-start metrics.
  • Add payload profiles for chat, embeddings, vision, and long-context workloads.
  • Add distributed load generation for multi-client benchmarking.
  • Add threshold checks for correlated GPU utilization, queue depth, and server-side error counters.
  • Add approximate numeric tolerance policies alongside the exact batch-invariance fingerprint check.

Releases

No releases published

Packages

 
 
 

Contributors