A Python tool for parsing GuideLLM benchmark results and extracting key performance metrics. This parser processes benchmark JSON files and outputs structured data including aggregate statistics and per-request timeseries data, with optional OpenSearch indexing support.
- Comprehensive Metrics Extraction: Parses GuideLLM benchmark results and extracts key performance indicators
- Timeseries Data: Captures individual request timing data for detailed analysis
- OpenSearch Integration: Optionally index results directly to OpenSearch for visualization and analysis
- Flexible Output: Export to JSON files or stdout
- Container Support: Includes Containerfile for easy deployment
The parser extracts the following aggregate metrics from benchmark runs:
uuid: Unique identifier for the benchmark runjob_name: Name of the job/benchmarksample: Sample number for the benchmarkguidellm_version: Version of GuideLLM used to generate the benchmark resultstimestamp: ISO 8601 timestamp of benchmark startbackend_model: Model being benchmarked
total_requests: Total number of requests madesuccessful_requests: Number of successful requestserrored_requests: Number of failed requestsincomplete_requests: Number of incomplete requests
-
Time to First Token (TTFT)
ttft_mean_ms: Mean TTFT in millisecondsttft_p99_ms: 99th percentile TTFT
-
Inter Token Latency (ITL)
itl_mean_ms: Mean ITL in millisecondsitl_p99_ms: 99th percentile ITL
-
Request Latency
request_latency_mean_seconds: Mean request latency in secondsrequest_latency_p95_seconds: 95th percentile request latency in secondsrequest_latency_p99_seconds: 99th percentile request latency in seconds
throughput_mean_rps: Mean requests per secondthroughput_p95_rps: 95th percentile RPSthroughput_p99_rps: 99th percentile RPS
prompt_tokens: Number of prompt tokensoutput_tokens: Number of output tokenstokens_per_second_mean: Mean total tokens per secondtokens_per_second_p95: 95th percentile tokens/sectokens_per_second_p99: 99th percentile tokens/secoutput_tokens_per_second_mean: Mean output tokens per secondoutput_tokens_per_second_p95: 95th percentile output tokens/secoutput_tokens_per_second_p99: 99th percentile output tokens/sectime_per_output_token_mean_ms: Mean time per output tokentime_per_output_token_p95_ms: 95th percentile time per tokentime_per_output_token_p99_ms: 99th percentile time per token
strategy: Benchmark strategy type (e.g., "constant")rate: Request rate (requests per second)
For each individual request (successful or errored), the parser extracts:
timestamp: ISO 8601 timestamp when request startederrored: Boolean indicating if request failedcompleted: Boolean indicating if request completedrequest_latency_seconds: Total request latency in secondstokens_per_second: Total tokens generated per second for this requestoutput_tokens_per_second: Output tokens generated per second for this requesttpot_ms: Time per output token in millisecondsitl_ms: Inter-token latency in millisecondsttft_ms: Time to first token in millisecondsuuid: Benchmark UUID (for correlation)job_name: Job name (for correlation)
Build the container image:
podman build -t guidellm-parser -f Containerfile .
# or with Docker
docker build -t guidellm-parser -f Containerfile .Parse a benchmark file and output to stdout:
python3 guidellm_parser.py --results benchmarks.json --uuid my-benchmark-001 --job-name "interactive-chat"Export results to a JSON file:
python3 guidellm_parser.py --results benchmarks.json --uuid my-benchmark-001 --job-name "interactive-chat" --output results.jsonParse and index results directly to OpenSearch:
python3 guidellm_parser.py \
--results benchmarks.json \
--uuid my-benchmark-001 \
--job-name "interactive-chat" \
--es-server http://localhost:9200 \
--es-index benchmark-resultspodman run --rm \
-v $(pwd):/data:Z \
guidellm-parser \
/usr/bin/guidellm_parser.py \
--results /data/benchmarks.json \
--uuid my-benchmark-001 \
--job-name "interactive-chat" \
--output /data/parsed_results.json| Argument | Required | Default | Description |
|---|---|---|---|
--results |
No | benchmarks.json |
Path to the GuideLLM results file |
--uuid |
No | "" |
UUID for the benchmark run |
--job-name, -j |
No | "" |
Name of the benchmark job |
--output, -o |
No | stdout | Output file path |
--es-server |
No | - | OpenSearch endpoint URL (e.g., http://localhost:9200) |
--es-index |
No | - | OpenSearch index name |
The parser outputs a JSON array where:
- The first element is the aggregate summary metrics
- Subsequent elements are timeseries entries for individual requests
[
{
"uuid": "c054eaf6-7b10-4dd5-a462-fbc010f7b09d",
"job_name": "interactive-chat",
"sample": 0,
"guidellm_version": "0.5.2",
"timestamp": "2025-09-23T23:59:06.125779",
"strategy": "constant",
"rate": 10.0,
"total_requests": 609,
"successful_requests": 586,
"errored_requests": 0,
"incomplete_requests": 23,
"prompt_tokens": "128",
"output_tokens": "128",
"backend_model": "Qwen/Qwen3-0.6B",
"ttft_mean_ms": 1006.2143587008272,
"ttft_p99_ms": 1019.4225311279297,
"itl_mean_ms": 10.423929083078537,
"itl_p99_ms": 10.578223100797398,
"throughput_mean_rps": 9.777764697593275,
"throughput_p95_rps": 12.67751159149574,
"throughput_p99_rps": 14.158848470118016,
"request_latency_mean_seconds": 2.3300955913986363,
"request_latency_p95_seconds": 2.3453574180603027,
"request_latency_p99_seconds": 2.3493130207061768,
"tokens_per_second_mean": 2426.93797445331,
"tokens_per_second_p95": 3785.4729241877258,
"tokens_per_second_p99": 17772.474576271186,
"output_tokens_per_second_mean": 1251.5371956866534,
"output_tokens_per_second_p95": 3480.7502074688796,
"output_tokens_per_second_p99": 8272.788954635109,
"time_per_output_token_mean_ms": 10.342492137116988,
"time_per_output_token_p95_ms": 10.460831224918365,
"time_per_output_token_p99_ms": 10.495580732822418
},
{
"timestamp": "2025-09-23T23:59:06.126977",
"errored": false,
"completed": true,
"request_latency_seconds": 2.3312907218933105,
"tokens_per_second": 108.09462656606951,
"output_tokens_per_second": 54.90520714467022,
"tpot_ms": 10.28873398900032,
"itl_ms": 10.369747642457016,
"ttft_ms": 1014.298677444458,
"uuid": "c054eaf6-7b10-4dd5-a462-fbc010f7b09d",
"job_name": "interactive-chat"
},
{
"timestamp": "2025-09-23T23:59:06.126454",
"errored": false,
"completed": true,
"request_latency_seconds": 2.334320545196533,
"tokens_per_second": 102.81364335067946,
"output_tokens_per_second": 54.83394312036238,
"tpot_ms": 10.307451710104942,
"itl_ms": 10.388612747192383,
"ttft_ms": 1014.9099826812744,
"uuid": "c054eaf6-7b10-4dd5-a462-fbc010f7b09d",
"job_name": "interactive-chat"
}
]This parser is designed to work with benchmark results from GuideLLM, a performance evaluation tool for large language models. GuideLLM generates comprehensive benchmark results in JSON format, which this parser processes into a structured format suitable for analysis and visualization.
-
Run GuideLLM benchmark:
guidellm --model your-model --backend openai --rate 10 --max-duration 60s --output-format json > benchmarks.json -
Parse the results:
python3 guidellm_parser.py --results benchmarks.json --uuid benchmark-001 --job-name "Model Test" --output parsed_results.json -
(Optional) Index to OpenSearch for visualization:
python3 guidellm_parser.py --results benchmarks.json --uuid benchmark-001 --es-server http://opensearch:9200 --es-index llm-benchmarks
The parser includes robust error handling for:
- Missing or invalid JSON files
- Missing benchmark data
- OpenSearch connection failures
- Invalid data structures
Errors are printed to stderr with descriptive messages, and the script exits with a non-zero status code on failure.
- Python 3.6+
opensearch-py(for OpenSearch integration)
This project is released under the Apache License 2.0.
For issues or questions, please open an issue on GitHub.