Benchmark harness for Triton-style model serving. The tool supports a dependency-free mock mode for CI and an optional live HTTP mode for testing a real inference server endpoint.
Inference infrastructure work is not just "run a model." Strong systems need repeatable benchmarks, clear latency percentiles, failure accounting, configurable concurrency, and results that can be compared over time. This repo is a small but reviewable artifact around that workflow.
- Concurrent load generation with configurable request count and worker count.
- Retry-aware request execution.
- Latency metrics: average, p50, p95, p99, min, max.
- Throughput and success-rate reporting.
- Dependency-free mock mode for CI and reviewer demos.
- Optional Triton HTTP mode for live model-serving benchmarks.
- JSON output for trend tracking and regression analysis.
- Prometheus text export for dashboard or CI artifact ingestion.
- Baseline-versus-candidate comparison with configurable p95 and success-rate gates.
- Correlated Triton/DCGM telemetry snapshots for GPU utilization, memory, queue, and server-duration context.
- Batch-invariance probes that compare exact output fingerprints in isolation and under concurrent noise traffic.
- Token-throughput and cost-to-serve estimates with explicit GPU price, power, and workload assumptions.
- Kubernetes Job example for cluster-local benchmark runs.
This repo focuses on repeatable load generation, concurrency controls, retry accounting, percentile latency, throughput, and machine-readable output that can feed regression tracking.
Relevant areas:
- AI infrastructure: model-serving reliability, latency analysis, failure accounting, and benchmark methodology.
- Platform engineering: CLI design, JSON artifacts, CI-friendly mock mode, and extension points for live services.
- Performance engineering: percentile metrics, concurrency sweeps, throughput measurement, and reproducible comparison paths.
- Infrastructure/SRE: Prometheus-compatible benchmark artifacts, correlated server telemetry, release regression checks, Kubernetes job posture, and operations notes.
- Start with
benchmark.pyfor benchmark orchestration and CLI behavior. - Review
tests/for metric and execution coverage. - Read
DESIGN.mdfor benchmark tradeoffs and production extensions. - Read
docs/OPERATIONS.mdfor regression triage, SLO-oriented checks, and Prometheus export usage. - Review
sample_results/mock_telemetry.promfor the synthetic Triton/DCGM telemetry fixture. - Review
deploy/kubernetes/benchmark-job.yamlfor the cluster-run shape. - Read
docs/PORTFOLIO_REVIEW.mdfor the technical review guide.
Run a local mock benchmark without GPU dependencies:
python benchmark.py --mode mock --num-requests 100 --concurrency 8Write JSON plus Prometheus text-format artifacts:
python benchmark.py --mode mock --num-requests 500 --concurrency 32 --prometheusAttach a synthetic Triton/DCGM telemetry snapshot to the benchmark result:
python benchmark.py \
--mode mock \
--num-requests 500 \
--concurrency 32 \
--telemetry-prometheus sample_results/mock_telemetry.prom \
--prometheusCheck whether fixed inputs produce identical outputs when served alongside concurrent traffic:
python benchmark.py \
--mode mock \
--num-requests 100 \
--concurrency 8 \
--batch-invariance-probes 16 \
--fail-on-batch-variance \
--prometheusEstimate token throughput, accelerator cost, energy, and normalized cost:
python benchmark.py \
--mode mock \
--num-requests 500 \
--concurrency 32 \
--input-tokens-per-request 1024 \
--output-tokens-per-request 256 \
--gpu-count 2 \
--gpu-hourly-cost-usd 4.50 \
--power-watts-per-gpu 600 \
--electricity-cost-usd-per-kwh 0.12 \
--prometheusThe estimate charges reserved GPU capacity for the full benchmark duration and normalizes cost by successful requests and tokens. Set electricity to zero when the hourly accelerator price already includes facility power.
Compare a candidate run against a saved baseline:
python benchmark.py \
--mode mock \
--num-requests 500 \
--concurrency 32 \
--baseline sample_results/mock_run.json \
--max-p95-regression-pct 10 \
--max-success-rate-drop 0.01 \
--fail-on-regression \
--prometheusRun against a live Triton endpoint:
pip install -r requirements.txt
python benchmark.py \
--mode triton \
--server-url localhost:8000 \
--model-name resnet50_trt_fp16 \
--input-name input \
--input-shape 1,3,224,224 \
--num-requests 500 \
--concurrency 32{
"mode": "mock",
"num_requests": 100,
"concurrency": 8,
"successful_requests": 100,
"failed_requests": 0,
"success_rate": 1.0,
"throughput_rps": 305.42,
"latency_ms": {
"avg": 21.38,
"p50": 21.74,
"p95": 33.11,
"p99": 34.67
}
}python -m unittest discover -s testsSee DESIGN.md for the benchmark model, tradeoffs, and production extensions.
This project covers benchmarking discipline, model-serving concepts, latency percentiles, failure accounting, Prometheus-compatible artifacts, release regression checks, and a clean path from local mock testing to live inference measurement.
- Add warmup windows and separate cold-start metrics.
- Add payload profiles for chat, embeddings, vision, and long-context workloads.
- Add distributed load generation for multi-client benchmarking.
- Add threshold checks for correlated GPU utilization, queue depth, and server-side error counters.
- Add approximate numeric tolerance policies alongside the exact batch-invariance fingerprint check.