Triton Inference Benchmark

Benchmark harness for Triton-style model serving. The tool supports a dependency-free mock mode for CI and an optional live HTTP mode for testing a real inference server endpoint.

Why This Exists

Inference infrastructure work is not just "run a model." Strong systems need repeatable benchmarks, clear latency percentiles, failure accounting, configurable concurrency, and results that can be compared over time. This repo is a small but reviewable artifact around that workflow.

Features

Concurrent load generation with configurable request count and worker count.
Retry-aware request execution.
Latency metrics: average, p50, p95, p99, min, max.
Throughput and success-rate reporting.
Dependency-free mock mode for CI and reviewer demos.
Optional Triton HTTP mode for live model-serving benchmarks.
JSON output for trend tracking and regression analysis.
Prometheus text export for dashboard or CI artifact ingestion.
Baseline-versus-candidate comparison with configurable p95 and success-rate gates.
Correlated Triton/DCGM telemetry snapshots for GPU utilization, memory, queue, and server-duration context.
Batch-invariance probes that compare exact output fingerprints in isolation and under concurrent noise traffic.
Token-throughput and cost-to-serve estimates with explicit GPU price, power, and workload assumptions.
Kubernetes Job example for cluster-local benchmark runs.

Engineering Scope

This repo focuses on repeatable load generation, concurrency controls, retry accounting, percentile latency, throughput, and machine-readable output that can feed regression tracking.

Relevant areas:

AI infrastructure: model-serving reliability, latency analysis, failure accounting, and benchmark methodology.
Platform engineering: CLI design, JSON artifacts, CI-friendly mock mode, and extension points for live services.
Performance engineering: percentile metrics, concurrency sweeps, throughput measurement, and reproducible comparison paths.
Infrastructure/SRE: Prometheus-compatible benchmark artifacts, correlated server telemetry, release regression checks, Kubernetes job posture, and operations notes.

Reviewer Fast Path

Start with benchmark.py for benchmark orchestration and CLI behavior.
Review tests/ for metric and execution coverage.
Read DESIGN.md for benchmark tradeoffs and production extensions.
Read docs/OPERATIONS.md for regression triage, SLO-oriented checks, and Prometheus export usage.
Review sample_results/mock_telemetry.prom for the synthetic Triton/DCGM telemetry fixture.
Review deploy/kubernetes/benchmark-job.yaml for the cluster-run shape.
Read docs/PORTFOLIO_REVIEW.md for the technical review guide.

Quick Start

Run a local mock benchmark without GPU dependencies:

python benchmark.py --mode mock --num-requests 100 --concurrency 8

Write JSON plus Prometheus text-format artifacts:

python benchmark.py --mode mock --num-requests 500 --concurrency 32 --prometheus

Attach a synthetic Triton/DCGM telemetry snapshot to the benchmark result:

python benchmark.py \
  --mode mock \
  --num-requests 500 \
  --concurrency 32 \
  --telemetry-prometheus sample_results/mock_telemetry.prom \
  --prometheus

Check whether fixed inputs produce identical outputs when served alongside concurrent traffic:

python benchmark.py \
  --mode mock \
  --num-requests 100 \
  --concurrency 8 \
  --batch-invariance-probes 16 \
  --fail-on-batch-variance \
  --prometheus

Estimate token throughput, accelerator cost, energy, and normalized cost:

python benchmark.py \
  --mode mock \
  --num-requests 500 \
  --concurrency 32 \
  --input-tokens-per-request 1024 \
  --output-tokens-per-request 256 \
  --gpu-count 2 \
  --gpu-hourly-cost-usd 4.50 \
  --power-watts-per-gpu 600 \
  --electricity-cost-usd-per-kwh 0.12 \
  --prometheus

The estimate charges reserved GPU capacity for the full benchmark duration and normalizes cost by successful requests and tokens. Set electricity to zero when the hourly accelerator price already includes facility power.

Compare a candidate run against a saved baseline:

python benchmark.py \
  --mode mock \
  --num-requests 500 \
  --concurrency 32 \
  --baseline sample_results/mock_run.json \
  --max-p95-regression-pct 10 \
  --max-success-rate-drop 0.01 \
  --fail-on-regression \
  --prometheus

Run against a live Triton endpoint:

pip install -r requirements.txt
python benchmark.py \
  --mode triton \
  --server-url localhost:8000 \
  --model-name resnet50_trt_fp16 \
  --input-name input \
  --input-shape 1,3,224,224 \
  --num-requests 500 \
  --concurrency 32

Example Output

{
  "mode": "mock",
  "num_requests": 100,
  "concurrency": 8,
  "successful_requests": 100,
  "failed_requests": 0,
  "success_rate": 1.0,
  "throughput_rps": 305.42,
  "latency_ms": {
    "avg": 21.38,
    "p50": 21.74,
    "p95": 33.11,
    "p99": 34.67
  }
}

Test

python -m unittest discover -s tests

Design Notes

See DESIGN.md for the benchmark model, tradeoffs, and production extensions.

Engineering Notes

This project covers benchmarking discipline, model-serving concepts, latency percentiles, failure accounting, Prometheus-compatible artifacts, release regression checks, and a clean path from local mock testing to live inference measurement.

Gaps Worth Closing Next

Add warmup windows and separate cold-start metrics.
Add payload profiles for chat, embeddings, vision, and long-context workloads.
Add distributed load generation for multi-client benchmarking.
Add threshold checks for correlated GPU utilization, queue depth, and server-side error counters.
Add approximate numeric tolerance policies alongside the exact batch-invariance fingerprint check.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Triton Inference Benchmark

Why This Exists

Features

Engineering Scope

Reviewer Fast Path

Quick Start

Example Output

Test

Design Notes

Engineering Notes

Gaps Worth Closing Next

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.github/workflows		.github/workflows
deploy/kubernetes		deploy/kubernetes
docs		docs
sample_results		sample_results
tests		tests
.gitignore		.gitignore
DESIGN.md		DESIGN.md
Dockerfile		Dockerfile
README.md		README.md
benchmark.py		benchmark.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Triton Inference Benchmark

Why This Exists

Features

Engineering Scope

Reviewer Fast Path

Quick Start

Example Output

Test

Design Notes

Engineering Notes

Gaps Worth Closing Next

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages