From ddaf4a6680daffc0ba1c14251d633c112afbf2cb Mon Sep 17 00:00:00 2001 From: David Butenhof Date: Thu, 25 Jun 2026 13:16:28 -0400 Subject: [PATCH 1/8] migration guide Signed-off-by: David Butenhof --- docs/guides/backends.md | 2 +- docs/guides/datasets.md | 2 +- docs/guides/embeddings.md | 2 +- docs/guides/v0.7.0_migration_guide.md | 108 ++++++++++++++++++++++++++ 4 files changed, 111 insertions(+), 3 deletions(-) create mode 100644 docs/guides/v0.7.0_migration_guide.md diff --git a/docs/guides/backends.md b/docs/guides/backends.md index 69bfe15e5..e998dce92 100644 --- a/docs/guides/backends.md +++ b/docs/guides/backends.md @@ -147,4 +147,4 @@ guidellm run \ ## Expanding Backend Support -GuideLLM is an open platform, and we encourage contributions to extend its backend support. Whether it's adding new server implementations, integrating with Python-based backends, or enhancing existing capabilities, your contributions are welcome. For more details on how to contribute, see the [CONTRIBUTING.md](../../CONTRIBUTING.md) file. +GuideLLM is an open platform, and we encourage contributions to extend its backend support. Whether it's adding new server implementations, integrating with Python-based backends, or enhancing existing capabilities, your contributions are welcome. For more details on how to contribute, see the [CONTRIBUTING.md](https://github.com/vllm-project/guidellm/blob/main/CONTRIBUTING.md) file. diff --git a/docs/guides/datasets.md b/docs/guides/datasets.md index c7f0c8157..9bb525783 100644 --- a/docs/guides/datasets.md +++ b/docs/guides/datasets.md @@ -333,7 +333,7 @@ guidellm preprocess dataset \ | Argument | Description | | ------------- | --------------------------------------------------------------------------------------------------------------------------------------------- | -| `DATA` | Identify the dataset to process. Supports all dataset formats documented in the [Dataset Configurations](../datasets.md). | +| `DATA` | Identify the dataset to process. Supports all dataset formats documented in the [Dataset Configurations](#datasets). | | `OUTPUT_PATH` | Path to save the processed dataset, including file suffix (e.g., `processed_dataset.jsonl`, `output.csv`). | | `--processor` | **Required.** Processor or tokenizer name/path for calculating token counts. Can be a Hugging Face model ID or local path. | | `--config` | **Required.** Configuration specifying target token sizes. Can be a JSON string, key=value pairs, or file path (.json, .yaml, .yml, .config). | diff --git a/docs/guides/embeddings.md b/docs/guides/embeddings.md index d37cc80bb..2558c435d 100644 --- a/docs/guides/embeddings.md +++ b/docs/guides/embeddings.md @@ -37,6 +37,6 @@ guidellm run \ ## See Also -- [Benchmark Profiles](benchmark-profiles.md) - Detailed explanation of all profile types +- [Benchmark Profiles](../getting-started/benchmark.md#benchmark-profiles---profile) - Detailed explanation of all profile types - [Datasets Guide](datasets.md) - Creating and using custom datasets - [Metrics Guide](metrics.md) - Understanding performance metrics diff --git a/docs/guides/v0.7.0_migration_guide.md b/docs/guides/v0.7.0_migration_guide.md new file mode 100644 index 000000000..343a8468b --- /dev/null +++ b/docs/guides/v0.7.0_migration_guide.md @@ -0,0 +1,108 @@ +# CLI Migration Guide + +## `guidellm benchmark [run]` + +Run a benchmark against a generative model. + +This command is now `guidellm run` + +| v0.6.0 option | v0.7.0 equivalent | +| :---- | :---- | +| \--backend-kwargs JSON string of arguments to pass to the backend. E.g., '{"api\_key": "apikey-\*", "verify": false}' | Options passed to `--backend`, like `--backend “kind=openai_http,api_key=sk…”` | +| \--backend Backend type. Options: vllm\_python, openai\_http. | Merged with `--backend-kwargs` `--backend ‘{“kind”: “openai_http”, “extras”: {“body”: {“temperature”: 0.6}}}’` | +| \--cooldown Cooldown specification: int, float, or dict as string (json or key=value). Controls time or requests after measurement ends. Numeric in (0, 1): percent of duration or request count. Numeric \>=1: duration in seconds or request count. Advanced config: see TransientPhaseConfig schema. | Specify with profile, e.g., `--profile kind=synchronous,cooldown=2` for a two second cooldown or `--profile ‘{“kind”:”concurrent”,”cooldown”:{“mode”:”duration”,”value”:2}}` | +| \--data-args JSON string of arguments to pass to dataset creation. | Specified with “load\_kwargs” as part of data, e.g., `--data ‘{“kind”:”huggingface”,”load_kwargs”:{“split”:”train”}}` | +| \--data-column-mapper JSON string of column mappings to apply to the dataset. E.g., '{"text\_column": "article", "output\_tokens\_count\_column" :"output\_tokens"}' | Data column mappers have a “kind”: `--data-column-mapper ‘{“kind”:”generative_column_mapper”,”column_mappings”:{“text_column”:”instruction”}}` | +| \--data-finalizer JSON string of finalizer to convert dataset rows to requests. E.g., 'generative' or '{"type": "generative"}' | Use `--data-finalizer kind=generative` | +| \--data-num-workers Number of worker processes for data loading. | Specified as part of Data Loader configuration with `--data-loader kind=pytorch,num_workers=3` | +| \--data-preprocessors-kwargs JSON string of arguments to pass to all preprocessors. | `--data-preprocessor ‘{“kind”:”encode_media”,”audio_kwargs”:{“format”:”mp3”}}` | +| \--data-preprocessors List of preprocessors to apply to the dataset. E.g., 'encode\_media,my\_custom\_preprocessor' | `--data-preprocessor kind=encode_media` … can be repeated to configure multiple preprocessors. | +| \--data-sampler Data sampler type. | Shuffle function is under `--data-loader kind=pytorch,shuffle=true` | +| \--data-samples Number of samples from dataset. \-1 (default) uses all samples and dynamically generates more. | Specify as part of Data Loader configuration, as `--data-loader kind=pytorch,samples=10` | +| \--data HuggingFace dataset ID, path to dataset, path to data file (csv/json/jsonl/txt), or synthetic data config (json/key=value). | `--data kind=huggingface,source=` `--data kind=csv_file,path=` `--data kind=synthetic_text,prompt_tokens=128,output_tokens=64` | +| \--dataloader-kwargs JSON string of arguments to pass to the dataloader constructor. | Passed directly to Data Loader, as `--data_loader kind=pytorch,shuffle=true,samples=100` | +| \--detect-saturation Enable over-saturation detection with default settings. | Enable oversaturation constraint, for example `--constraint kind=over_saturation` | +| \--disable-console-interactive Disable interactive console progress updates. | Unchanged: `--disable-console-interactive` or `--disable-progress` | +| \--disable-console Disable all outputs to the console (updates, interactive progress, results). | Unchanged: `--disable-console` or `--disable-console-outputs` | +| \--max-error-rate Maximum error rate before stopping the benchmark. | Enable maximum error rate constraint, for example `--constraint kind=max_error_rate,rate=10` | +| \--max-errors Maximum errors before stopping the benchmark. | Enable maximum error count constraint, for example `--constraint kind=max_errors,count=10` | +| \--max-global-error-rate Maximum global error rate across all benchmarks. | Enable maximum global error rate constraint, for example `--constraint kind=max_global_error_rate,rate=10,minimum=100` | +| \--max-requests Maximum requests per benchmark. If None, runs until max\_seconds or data exhaustion. | Enable maximum requests constraint, for example `--constraint kind=max_requests,count=1000` | +| \--max-seconds Maximum seconds per benchmark. If None, runs until max\_requests or data exhaustion. | Enable maximum duration constraint, for example `--constraint kind=max_duration,seconds=60` | +| \--model Model ID to benchmark. If not provided, uses first available model. | Specify a model name as part of the backend configuration, for example `--backend kind=openai_http,model=gpt4` | +| \--output-dir or –output-path: The directory path to save file output types in | Specify paths as part of the individual output configurations, for example `--output kind=json,path=/tmp/reports/benchmark.json` | +| \--outputs The filename.ext for each of the outputs to create or the alises (json, csv, html) for the output files to create with their default file names (benchmark.\[EXT\]) | Specify multiple output formats by repeating the `--output` option, for example `--output kind=json,path=benchmark.json –output kind=csv,path=benchmark.csv` | +| \--over-saturation Enable over-saturation detection. Pass a JSON dict with configuration (e.g., '{"enabled": true, "min\_seconds": 30}'). Defaults to None (disabled). | Enable oversaturation constraint, for example `--constraint kind=over_saturation,mode=enforce,min_seconds=30` | +| \--processor-args JSON string of arguments to pass to the processor constructor. | Specify options directly to the tokenizer, for example `--tokenizer ‘{“kind”:”huggingface_auto”,”load_kwargs”:{“fast”:true}}’` | +| \--processor Processor or tokenizer for token count calculations. If not provided, loads from model. | Defaults to the default tokenizer for the first model supported by the backend target. To override, `--tokenizer kind=huggingface_auto,model=gpt4` | +| \--profile Benchmark profile type. Options: sweep, async, poisson, synchronous, throughput, concurrent, constant. | Specify the benchmark profile to use, for example, `--profile kind=sweep,sweep_size=10,warmup=1,cooldown=1` | +| \--rampup The time, in seconds, to ramp up the request rate over. Applicable for Throughput, Concurrent, and Constant strategies | Specify as part of profile, for example `--profile kind=constant,rate=10,rampup_duration=2` | +| \--random-seed Random seed for reproducibility. | Specify the random seed configuration like `--seed kind=static,value=42` | +| \--rate Benchmark rate(s) to test. Meaning depends on profile: sweep=number of benchmarks, concurrent=concurrent requests, async/constant/poisson=requests per second. | “Rate” was overloaded to specify the primary configuration for each profile type. Specify with `--profile` or `--override profile.`: async/constant/poisson → `rate`, concurrent → `streams`, sweep → `sweep_size`, throughput → `max_concurrency`. | +| \--request-format Format to use for requests. Options depend on backend. For vLLM backend: plain (no chat template, text appending only), default-template (use tokenizer default), or a file path / single-line template per vLLM docs. Default: default-templateFor openai backend: http endpoint path (/v1/chat/completions, /v1/completions, /v1/audio/transcriptions, /v1/audio/translations) or alias (e.g. chat\_completions); default /v1/chat/completions. | Specify as part of backend configuration, like `--backend kind=openai_http,request_format=/v1/responses` | +| \--sample-requests Number of sample requests per status to save. None (default) saves all, recommended: 20\. | TBD | +| \--scenario Builtin scenario name or path to config file. CLI options override scenario settings. | The preferred name is now `--config`, although both `--scenario` and `-c` are aliases, for example `--config chat` or `--config my-scenario.yaml`. | +| \--target Target backend URL (e.g., [http://localhost:8000](http://localhost:8000)). | Specify as part of backend configuration, for example `--backend kind=openai_http,target=http://localhost:8000` | +| \--warmup Warmup specification: int, float, or dict as string (json or key=value). Controls time or requests before measurement starts. Numeric in (0, 1): percent of duration or request count. Numeric \>=1: duration in seconds or request count. Advanced config: see TransientPhaseConfig schema. | Specify with profile, e.g., `--profile kind=synchronous,warmup=2` for a two second warmup or `--profile ‘{“kind”:”concurrent”,”warmup”:{“mode”:”duration”,”value”:2}}` | + +## `Guidellm benchmark from-file` + +Load a saved benchmark report and optionally re-export data + +| Option | v0.7.0 equivalent | +| :---- | :---- | +| PATH | Unchanged | +| Path to the saved benchmark report file (default: ./benchmarks.json). | Unchanged | +| \--output-path | Unchanged | +| Directory or file path to save re-exported benchmark results. If a directory, all output formats will be saved there. If a file, the matching format will be saved to that file. | Unchanged | +| \--output-formats | Unchanged | +| Output formats for benchmark results (e.g., console, json, html, csv). | Unchanged | + +## `guidellm config` + +Show configuration settings + +Changed from `guidellm config` to `guidellm env` to clarify that it displays environment variables affecting GuideLLM operation. + +`guidellm config` will be used later for a different purpose, to generate YAML config files from `run` options. + +## `guidellm mock-server` + +Start a mock OpenAI/vLLM-compatible server for testing. **\[NO CHANGE\]** + +| v0.6.0 option | v0.7.0 equivalent | +| :---- | :---- | +| \--host TEXT Host address to bind the server to. | Unchanged | +| \--port INTEGER Port number to bind the server to. | Unchanged | +| \--workers INTEGER Number of worker processes. | Unchanged | +| \--model TEXT Name of the model to mock. | Unchanged | +| \--processor TEXT Processor or tokenizer to use for requests. | Unchanged | +| \--request-latency FLOAT Request latency in seconds for non-streaming requests. | Unchanged | +| \--request-latency-std FLOAT Request latency standard deviation in seconds (normal distribution). | Unchanged | +| \--ttft-ms FLOAT Time to first token in milliseconds for streaming requests. | Unchanged | +| \--ttft-ms-std FLOAT Time to first token standard deviation in milliseconds. | Unchanged | +| \--itl-ms FLOAT Inter-token latency in milliseconds for streaming requests. | Unchanged | +| \--itl-ms-std FLOAT Inter-token latency standard deviation in milliseconds. | Unchanged | +| \--output-tokens INTEGER Number of output tokens for streaming requests. | Unchanged | +| \--output-tokens-std FLOAT Output tokens standard deviation (normal distribution). | Unchanged | + +## `guidellm preprocess dataset` + +Tools for preprocessing datasets for use in benchmarks. + +| v0.6.0 option | v0.7.0 equivalent | +| :---- | :---- | +| data (positional parameter) | Use dataset descriptor, for example `kind=huggingface,source=` | +| output_path (positional parameter) | Results file path, for example `file.json` | +| \--processor TEXT Processor or tokenizer name for calculating token counts. | Unchanged | +| \--config TEXT PreprocessDatasetConfig as JSON string, key=value pairs, or file path (.json, .yaml, .yml, .config). Example: `prompt_tokens=100,output_tokens=50,prefix_tokens_max=10` or `{"prompt_tokens": 100, "output_tokens": 50, "prefix_tokens_max": 10}` \[Mandatory\] | Unchanged | +| \--processor-args TEXT JSON string of arguments to pass to the processor constructor. | Unchanged | +| \--data-args TEXT JSON string of arguments to pass to dataset creation | Unchanged | +| \--data-column-mapper JSON string of column mappings to apply to the dataset | Specify a data column mapper object, for example `--data-column-mapper ‘{“kind”:”generative_column_mapper”,”column_mappings”:{“text_column”:”instruction”}}` | +| \--short-prompt-strategy \[ignore|concatenate|pad|error\] Strategy for handling prompts shorter than target length. \[default: ignore\] | Unchanged | +| \--pad-char TEXT Character to pad short prompts with when using “pad” strategy (used with ‘concatenate’ strategy). | Unchanged | +| \--concat-delimiter TEXT Delimiter for concatenating short prompts (used with ‘concatenate’ strategy). | Unchanged | +| \--include-prefix-in-token-count Include prefix tokens in prompt token count calculation. | Unchanged | +| \--push-to-hub Push the processed dataset to Hugging Face Hub. | Unchanged | +| \--hub-dataset-id TEXT Hugging Face Hub dataset ID for upload (required if `--push-to-hub` is set). | Unchanged | +| \--random-seed INTEGER Random seed for reproducible token sampling. \[default: 42\] | Unchanged | From a7b29f600a069255b656c859002e9ea52cce36dd Mon Sep 17 00:00:00 2001 From: David Butenhof Date: Thu, 25 Jun 2026 15:41:43 -0400 Subject: [PATCH 2/8] Update Signed-off-by: David Butenhof --- docs/guides/v0.7.0_migration_guide.md | 163 +++++++++++++------------- 1 file changed, 84 insertions(+), 79 deletions(-) diff --git a/docs/guides/v0.7.0_migration_guide.md b/docs/guides/v0.7.0_migration_guide.md index 343a8468b..50d6abae8 100644 --- a/docs/guides/v0.7.0_migration_guide.md +++ b/docs/guides/v0.7.0_migration_guide.md @@ -6,57 +6,62 @@ Run a benchmark against a generative model. This command is now `guidellm run` -| v0.6.0 option | v0.7.0 equivalent | -| :---- | :---- | -| \--backend-kwargs JSON string of arguments to pass to the backend. E.g., '{"api\_key": "apikey-\*", "verify": false}' | Options passed to `--backend`, like `--backend “kind=openai_http,api_key=sk…”` | -| \--backend Backend type. Options: vllm\_python, openai\_http. | Merged with `--backend-kwargs` `--backend ‘{“kind”: “openai_http”, “extras”: {“body”: {“temperature”: 0.6}}}’` | -| \--cooldown Cooldown specification: int, float, or dict as string (json or key=value). Controls time or requests after measurement ends. Numeric in (0, 1): percent of duration or request count. Numeric \>=1: duration in seconds or request count. Advanced config: see TransientPhaseConfig schema. | Specify with profile, e.g., `--profile kind=synchronous,cooldown=2` for a two second cooldown or `--profile ‘{“kind”:”concurrent”,”cooldown”:{“mode”:”duration”,”value”:2}}` | -| \--data-args JSON string of arguments to pass to dataset creation. | Specified with “load\_kwargs” as part of data, e.g., `--data ‘{“kind”:”huggingface”,”load_kwargs”:{“split”:”train”}}` | -| \--data-column-mapper JSON string of column mappings to apply to the dataset. E.g., '{"text\_column": "article", "output\_tokens\_count\_column" :"output\_tokens"}' | Data column mappers have a “kind”: `--data-column-mapper ‘{“kind”:”generative_column_mapper”,”column_mappings”:{“text_column”:”instruction”}}` | -| \--data-finalizer JSON string of finalizer to convert dataset rows to requests. E.g., 'generative' or '{"type": "generative"}' | Use `--data-finalizer kind=generative` | -| \--data-num-workers Number of worker processes for data loading. | Specified as part of Data Loader configuration with `--data-loader kind=pytorch,num_workers=3` | -| \--data-preprocessors-kwargs JSON string of arguments to pass to all preprocessors. | `--data-preprocessor ‘{“kind”:”encode_media”,”audio_kwargs”:{“format”:”mp3”}}` | -| \--data-preprocessors List of preprocessors to apply to the dataset. E.g., 'encode\_media,my\_custom\_preprocessor' | `--data-preprocessor kind=encode_media` … can be repeated to configure multiple preprocessors. | -| \--data-sampler Data sampler type. | Shuffle function is under `--data-loader kind=pytorch,shuffle=true` | -| \--data-samples Number of samples from dataset. \-1 (default) uses all samples and dynamically generates more. | Specify as part of Data Loader configuration, as `--data-loader kind=pytorch,samples=10` | -| \--data HuggingFace dataset ID, path to dataset, path to data file (csv/json/jsonl/txt), or synthetic data config (json/key=value). | `--data kind=huggingface,source=` `--data kind=csv_file,path=` `--data kind=synthetic_text,prompt_tokens=128,output_tokens=64` | -| \--dataloader-kwargs JSON string of arguments to pass to the dataloader constructor. | Passed directly to Data Loader, as `--data_loader kind=pytorch,shuffle=true,samples=100` | -| \--detect-saturation Enable over-saturation detection with default settings. | Enable oversaturation constraint, for example `--constraint kind=over_saturation` | -| \--disable-console-interactive Disable interactive console progress updates. | Unchanged: `--disable-console-interactive` or `--disable-progress` | -| \--disable-console Disable all outputs to the console (updates, interactive progress, results). | Unchanged: `--disable-console` or `--disable-console-outputs` | -| \--max-error-rate Maximum error rate before stopping the benchmark. | Enable maximum error rate constraint, for example `--constraint kind=max_error_rate,rate=10` | -| \--max-errors Maximum errors before stopping the benchmark. | Enable maximum error count constraint, for example `--constraint kind=max_errors,count=10` | -| \--max-global-error-rate Maximum global error rate across all benchmarks. | Enable maximum global error rate constraint, for example `--constraint kind=max_global_error_rate,rate=10,minimum=100` | -| \--max-requests Maximum requests per benchmark. If None, runs until max\_seconds or data exhaustion. | Enable maximum requests constraint, for example `--constraint kind=max_requests,count=1000` | -| \--max-seconds Maximum seconds per benchmark. If None, runs until max\_requests or data exhaustion. | Enable maximum duration constraint, for example `--constraint kind=max_duration,seconds=60` | -| \--model Model ID to benchmark. If not provided, uses first available model. | Specify a model name as part of the backend configuration, for example `--backend kind=openai_http,model=gpt4` | -| \--output-dir or –output-path: The directory path to save file output types in | Specify paths as part of the individual output configurations, for example `--output kind=json,path=/tmp/reports/benchmark.json` | -| \--outputs The filename.ext for each of the outputs to create or the alises (json, csv, html) for the output files to create with their default file names (benchmark.\[EXT\]) | Specify multiple output formats by repeating the `--output` option, for example `--output kind=json,path=benchmark.json –output kind=csv,path=benchmark.csv` | -| \--over-saturation Enable over-saturation detection. Pass a JSON dict with configuration (e.g., '{"enabled": true, "min\_seconds": 30}'). Defaults to None (disabled). | Enable oversaturation constraint, for example `--constraint kind=over_saturation,mode=enforce,min_seconds=30` | -| \--processor-args JSON string of arguments to pass to the processor constructor. | Specify options directly to the tokenizer, for example `--tokenizer ‘{“kind”:”huggingface_auto”,”load_kwargs”:{“fast”:true}}’` | -| \--processor Processor or tokenizer for token count calculations. If not provided, loads from model. | Defaults to the default tokenizer for the first model supported by the backend target. To override, `--tokenizer kind=huggingface_auto,model=gpt4` | -| \--profile Benchmark profile type. Options: sweep, async, poisson, synchronous, throughput, concurrent, constant. | Specify the benchmark profile to use, for example, `--profile kind=sweep,sweep_size=10,warmup=1,cooldown=1` | -| \--rampup The time, in seconds, to ramp up the request rate over. Applicable for Throughput, Concurrent, and Constant strategies | Specify as part of profile, for example `--profile kind=constant,rate=10,rampup_duration=2` | -| \--random-seed Random seed for reproducibility. | Specify the random seed configuration like `--seed kind=static,value=42` | -| \--rate Benchmark rate(s) to test. Meaning depends on profile: sweep=number of benchmarks, concurrent=concurrent requests, async/constant/poisson=requests per second. | “Rate” was overloaded to specify the primary configuration for each profile type. Specify with `--profile` or `--override profile.`: async/constant/poisson → `rate`, concurrent → `streams`, sweep → `sweep_size`, throughput → `max_concurrency`. | -| \--request-format Format to use for requests. Options depend on backend. For vLLM backend: plain (no chat template, text appending only), default-template (use tokenizer default), or a file path / single-line template per vLLM docs. Default: default-templateFor openai backend: http endpoint path (/v1/chat/completions, /v1/completions, /v1/audio/transcriptions, /v1/audio/translations) or alias (e.g. chat\_completions); default /v1/chat/completions. | Specify as part of backend configuration, like `--backend kind=openai_http,request_format=/v1/responses` | -| \--sample-requests Number of sample requests per status to save. None (default) saves all, recommended: 20\. | TBD | -| \--scenario Builtin scenario name or path to config file. CLI options override scenario settings. | The preferred name is now `--config`, although both `--scenario` and `-c` are aliases, for example `--config chat` or `--config my-scenario.yaml`. | -| \--target Target backend URL (e.g., [http://localhost:8000](http://localhost:8000)). | Specify as part of backend configuration, for example `--backend kind=openai_http,target=http://localhost:8000` | -| \--warmup Warmup specification: int, float, or dict as string (json or key=value). Controls time or requests before measurement starts. Numeric in (0, 1): percent of duration or request count. Numeric \>=1: duration in seconds or request count. Advanced config: see TransientPhaseConfig schema. | Specify with profile, e.g., `--profile kind=synchronous,warmup=2` for a two second warmup or `--profile ‘{“kind”:”concurrent”,”warmup”:{“mode”:”duration”,”value”:2}}` | +| v0.6.0 option | v0.7.0 equivalent | +| :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| --backend-kwargs JSON string of arguments to pass to the backend. E.g., '{"api_key": "apikey-\*", "verify": false}' | Options passed to `--backend`, like `--backend “kind=openai_http,api_key=sk…”` | +| --backend Backend type. Options: vllm_python, openai_http. | Merged with `--backend-kwargs` `--backend ‘{“kind”: “openai_http”, “extras”: {“body”: {“temperature”: 0.6}}}’` | +| --cooldown Cooldown specification: int, float, or dict as string (json or key=value). Controls time or requests after measurement ends. Numeric in (0, 1): percent of duration or request count. Numeric >=1: duration in seconds or request count. Advanced config: see TransientPhaseConfig schema. | Specify with profile, e.g., `--profile kind=synchronous,cooldown=2` for a two second cooldown or `--profile ‘{“kind”:”concurrent”,”cooldown”:{“mode”:”duration”,”value”:2}}` | +| --data-args JSON string of arguments to pass to dataset creation. | Specified with “load_kwargs” as part of data, e.g., `--data ‘{“kind”:”huggingface”,”load_kwargs”:{“split”:”train”}}` | +| --data-column-mapper JSON string of column mappings to apply to the dataset. E.g., '{"text_column": "article", "output_tokens_count_column" :"output_tokens"}' | Data column mappers have a “kind”: `--data-column-mapper ‘{“kind”:”generative_column_mapper”,”column_mappings”:{“text_column”:”instruction”}}` | +| --data-finalizer JSON string of finalizer to convert dataset rows to requests. E.g., 'generative' or '{"type": "generative"}' | Use `--data-finalizer kind=generative` | +| --data-num-workers Number of worker processes for data loading. | Specified as part of Data Loader configuration with `--data-loader kind=pytorch,num_workers=3` | +| --data-preprocessors-kwargs JSON string of arguments to pass to all preprocessors. | `--data-preprocessor ‘{“kind”:”encode_media”,”audio_kwargs”:{“format”:”mp3”}}` | +| --data-preprocessors List of preprocessors to apply to the dataset. E.g., 'encode_media,my_custom_preprocessor' | `--data-preprocessor kind=encode_media` … can be repeated to configure multiple preprocessors. | +| --data-sampler Data sampler type. | Shuffle function is under `--data-loader kind=pytorch,shuffle=true` | +| --data-samples Number of samples from dataset. -1 (default) uses all samples and dynamically generates more. | Specify as part of Data Loader configuration, as `--data-loader kind=pytorch,samples=10` | +| --data HuggingFace dataset ID, path to dataset, path to data file (csv/json/jsonl/txt), or synthetic data config (json/key=value). | `--data kind=huggingface,source=` `--data kind=csv_file,path=` `--data kind=synthetic_text,prompt_tokens=128,output_tokens=64` | +| --dataloader-kwargs JSON string of arguments to pass to the dataloader constructor. | Passed directly to Data Loader, as `--data_loader kind=pytorch,shuffle=true,samples=100` | +| --detect-saturation Enable over-saturation detection with default settings. | Enable oversaturation constraint, for example `--constraint kind=over_saturation` | +| --disable-console-interactive Disable interactive console progress updates. | Unchanged: `--disable-console-interactive` or `--disable-progress` | +| --disable-console Disable all outputs to the console (updates, interactive progress, results). | Unchanged: `--disable-console` or `--disable-console-outputs` | +| --max-error-rate Maximum error rate before stopping the benchmark. | Enable maximum error rate constraint, for example `--constraint kind=max_error_rate,rate=10` | +| --max-errors Maximum errors before stopping the benchmark. | Enable maximum error count constraint, for example `--constraint kind=max_errors,count=10` | +| --max-global-error-rate Maximum global error rate across all benchmarks. | Enable maximum global error rate constraint, for example `--constraint kind=max_global_error_rate,rate=10,minimum=100` | +| --max-requests Maximum requests per benchmark. If None, runs until max_seconds or data exhaustion. | Enable maximum requests constraint, for example `--constraint kind=max_requests,count=1000` | +| --max-seconds Maximum seconds per benchmark. If None, runs until max_requests or data exhaustion. | Enable maximum duration constraint, for example `--constraint kind=max_duration,seconds=60` | +| --model Model ID to benchmark. If not provided, uses first available model. | Specify a model name as part of the backend configuration, for example `--backend kind=openai_http,model=gpt4` | +| --output-dir or –output-path: The directory path to save file output types in | Specify paths as part of the individual output configurations, for example `--output kind=json,path=/tmp/reports/benchmark.json` | +| --outputs The filename.ext for each of the outputs to create or the alises (json, csv, html) for the output files to create with their default file names (benchmark.[EXT]) | Specify multiple output formats by repeating the `--output` option, for example `--output kind=json,path=benchmark.json –output kind=csv,path=benchmark.csv` | +| --over-saturation Enable over-saturation detection. Pass a JSON dict with configuration (e.g., '{"enabled": true, "min_seconds": 30}'). Defaults to None (disabled). | Enable oversaturation constraint, for example `--constraint kind=over_saturation,mode=enforce,min_seconds=30` | +| --processor-args JSON string of arguments to pass to the processor constructor. | Specify options directly to the tokenizer, for example `--tokenizer ‘{“kind”:”huggingface_auto”,”load_kwargs”:{“fast”:true}}’` | +| --processor Processor or tokenizer for token count calculations. If not provided, loads from model. | Defaults to the default tokenizer for the first model supported by the backend target. To override, `--tokenizer kind=huggingface_auto,model=gpt4` | +| --profile Benchmark profile type. Options: sweep, async, poisson, synchronous, throughput, concurrent, constant. | Specify the benchmark profile to use, for example, `--profile kind=sweep,sweep_size=10,warmup=1,cooldown=1` | +| --rampup The time, in seconds, to ramp up the request rate over. Applicable for Throughput, Concurrent, and Constant strategies | Specify as part of profile, for example `--profile kind=constant,rate=10,rampup_duration=2` | +| --random-seed Random seed for reproducibility. | Specify the random seed configuration like `--seed kind=static,value=42` | +| --rate Benchmark rate(s) to test. Meaning depends on profile: sweep=number of benchmarks, concurrent=concurrent requests, async/constant/poisson=requests per second. | “Rate” was overloaded to specify the primary configuration for each profile type. Specify with `--profile` or `--override profile.`: async/constant/poisson → `rate`, concurrent → `streams`, sweep → `sweep_size`, throughput → `max_concurrency`. | +| --request-format Format to use for requests. Options depend on backend. For vLLM backend: plain (no chat template, text appending only), default-template (use tokenizer default), or a file path / single-line template per vLLM docs. Default: default-templateFor openai backend: http endpoint path (/v1/chat/completions, /v1/completions, /v1/audio/transcriptions, /v1/audio/translations) or alias (e.g. chat_completions); default /v1/chat/completions. | Specify as part of backend configuration, like `--backend kind=openai_http,request_format=/v1/responses` | +| --sample-requests Number of sample requests per status to save. None (default) saves all, recommended: 20. | Specify as part of the profile configuration, for example `--profile kind=concurrent,sample_size=20` | +| --scenario Builtin scenario name or path to config file. CLI options override scenario settings. | The preferred name is now `--config`, although both `--scenario` and `-c` are aliases, for example `--config chat` or `--config my-scenario.yaml`. | +| --target Target backend URL (e.g., [http://localhost:8000](http://localhost:8000)). | Specify as part of backend configuration, for example `--backend kind=openai_http,target=http://localhost:8000` | +| --warmup Warmup specification: int, float, or dict as string (json or key=value). Controls time or requests before measurement starts. Numeric in (0, 1): percent of duration or request count. Numeric >=1: duration in seconds or request count. Advanced config: see TransientPhaseConfig schema. | Specify with profile, e.g., `--profile kind=synchronous,warmup=2` for a two second warmup or `--profile ‘{“kind”:”concurrent”,”warmup”:{“mode”:”duration”,”value”:2}}` | + +| **NEW OPTIONS** | **v0.7.0 new options** | +| :----------------------------------------------------------------------- | :---------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| Add metadata to output reports | Specify key-value pairs of metadata labels which will be written to the output reports, for example `--label gpu=NVIDIA_Z500 --label creator=Intrepid_Adventurer` | +| Override concurrent profile stream count and async/constant/poisson rate | Specify profile settings to override the default profile settings, for example `--override profile.rate 10,20,30` or `--override profile.streams 10,20,30` | ## `Guidellm benchmark from-file` Load a saved benchmark report and optionally re-export data -| Option | v0.7.0 equivalent | -| :---- | :---- | -| PATH | Unchanged | -| Path to the saved benchmark report file (default: ./benchmarks.json). | Unchanged | -| \--output-path | Unchanged | -| Directory or file path to save re-exported benchmark results. If a directory, all output formats will be saved there. If a file, the matching format will be saved to that file. | Unchanged | -| \--output-formats | Unchanged | -| Output formats for benchmark results (e.g., console, json, html, csv). | Unchanged | +| Option | v0.7.0 equivalent | +| :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :---------------- | +| PATH | Unchanged | +| Path to the saved benchmark report file (default: ./benchmarks.json). | Unchanged | +| --output-path | Unchanged | +| Directory or file path to save re-exported benchmark results. If a directory, all output formats will be saved there. If a file, the matching format will be saved to that file. | Unchanged | +| --output-formats | Unchanged | +| Output formats for benchmark results (e.g., console, json, html, csv). | Unchanged | ## `guidellm config` @@ -68,41 +73,41 @@ Changed from `guidellm config` to `guidellm env` to clarify that it displays env ## `guidellm mock-server` -Start a mock OpenAI/vLLM-compatible server for testing. **\[NO CHANGE\]** - -| v0.6.0 option | v0.7.0 equivalent | -| :---- | :---- | -| \--host TEXT Host address to bind the server to. | Unchanged | -| \--port INTEGER Port number to bind the server to. | Unchanged | -| \--workers INTEGER Number of worker processes. | Unchanged | -| \--model TEXT Name of the model to mock. | Unchanged | -| \--processor TEXT Processor or tokenizer to use for requests. | Unchanged | -| \--request-latency FLOAT Request latency in seconds for non-streaming requests. | Unchanged | -| \--request-latency-std FLOAT Request latency standard deviation in seconds (normal distribution). | Unchanged | -| \--ttft-ms FLOAT Time to first token in milliseconds for streaming requests. | Unchanged | -| \--ttft-ms-std FLOAT Time to first token standard deviation in milliseconds. | Unchanged | -| \--itl-ms FLOAT Inter-token latency in milliseconds for streaming requests. | Unchanged | -| \--itl-ms-std FLOAT Inter-token latency standard deviation in milliseconds. | Unchanged | -| \--output-tokens INTEGER Number of output tokens for streaming requests. | Unchanged | -| \--output-tokens-std FLOAT Output tokens standard deviation (normal distribution). | Unchanged | +Start a mock OpenAI/vLLM-compatible server for testing. **[NO CHANGE]** + +| v0.6.0 option | v0.7.0 equivalent | +| :----------------------------------------------------------------------------------------------- | :---------------- | +| --host TEXT Host address to bind the server to. | Unchanged | +| --port INTEGER Port number to bind the server to. | Unchanged | +| --workers INTEGER Number of worker processes. | Unchanged | +| --model TEXT Name of the model to mock. | Unchanged | +| --processor TEXT Processor or tokenizer to use for requests. | Unchanged | +| --request-latency FLOAT Request latency in seconds for non-streaming requests. | Unchanged | +| --request-latency-std FLOAT Request latency standard deviation in seconds (normal distribution). | Unchanged | +| --ttft-ms FLOAT Time to first token in milliseconds for streaming requests. | Unchanged | +| --ttft-ms-std FLOAT Time to first token standard deviation in milliseconds. | Unchanged | +| --itl-ms FLOAT Inter-token latency in milliseconds for streaming requests. | Unchanged | +| --itl-ms-std FLOAT Inter-token latency standard deviation in milliseconds. | Unchanged | +| --output-tokens INTEGER Number of output tokens for streaming requests. | Unchanged | +| --output-tokens-std FLOAT Output tokens standard deviation (normal distribution). | Unchanged | ## `guidellm preprocess dataset` Tools for preprocessing datasets for use in benchmarks. -| v0.6.0 option | v0.7.0 equivalent | -| :---- | :---- | -| data (positional parameter) | Use dataset descriptor, for example `kind=huggingface,source=` | -| output_path (positional parameter) | Results file path, for example `file.json` | -| \--processor TEXT Processor or tokenizer name for calculating token counts. | Unchanged | -| \--config TEXT PreprocessDatasetConfig as JSON string, key=value pairs, or file path (.json, .yaml, .yml, .config). Example: `prompt_tokens=100,output_tokens=50,prefix_tokens_max=10` or `{"prompt_tokens": 100, "output_tokens": 50, "prefix_tokens_max": 10}` \[Mandatory\] | Unchanged | -| \--processor-args TEXT JSON string of arguments to pass to the processor constructor. | Unchanged | -| \--data-args TEXT JSON string of arguments to pass to dataset creation | Unchanged | -| \--data-column-mapper JSON string of column mappings to apply to the dataset | Specify a data column mapper object, for example `--data-column-mapper ‘{“kind”:”generative_column_mapper”,”column_mappings”:{“text_column”:”instruction”}}` | -| \--short-prompt-strategy \[ignore|concatenate|pad|error\] Strategy for handling prompts shorter than target length. \[default: ignore\] | Unchanged | -| \--pad-char TEXT Character to pad short prompts with when using “pad” strategy (used with ‘concatenate’ strategy). | Unchanged | -| \--concat-delimiter TEXT Delimiter for concatenating short prompts (used with ‘concatenate’ strategy). | Unchanged | -| \--include-prefix-in-token-count Include prefix tokens in prompt token count calculation. | Unchanged | -| \--push-to-hub Push the processed dataset to Hugging Face Hub. | Unchanged | -| \--hub-dataset-id TEXT Hugging Face Hub dataset ID for upload (required if `--push-to-hub` is set). | Unchanged | -| \--random-seed INTEGER Random seed for reproducible token sampling. \[default: 42\] | Unchanged | +| v0.6.0 option | v0.7.0 equivalent | +| :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :----------------------------------------------------------------------------------------------------------------------------------------------------------- | +| data (positional parameter) | Use dataset descriptor, for example `kind=huggingface,source=` | +| output_path (positional parameter) | Results file path, for example `file.json` | +| --processor TEXT Processor or tokenizer name for calculating token counts. | Unchanged | +| --config TEXT PreprocessDatasetConfig as JSON string, key=value pairs, or file path (.json, .yaml, .yml, .config). Example: `prompt_tokens=100,output_tokens=50,prefix_tokens_max=10` or `{"prompt_tokens": 100, "output_tokens": 50, "prefix_tokens_max": 10}` [Mandatory] | Unchanged | +| --processor-args TEXT JSON string of arguments to pass to the processor constructor. | Unchanged | +| --data-args TEXT JSON string of arguments to pass to dataset creation | Unchanged | +| --data-column-mapper JSON string of column mappings to apply to the dataset | Specify a data column mapper object, for example `--data-column-mapper ‘{“kind”:”generative_column_mapper”,”column_mappings”:{“text_column”:”instruction”}}` | +| --short-prompt-strategy \[ignore | concatenate | +| --pad-char TEXT Character to pad short prompts with when using “pad” strategy (used with ‘concatenate’ strategy). | Unchanged | +| --concat-delimiter TEXT Delimiter for concatenating short prompts (used with ‘concatenate’ strategy). | Unchanged | +| --include-prefix-in-token-count Include prefix tokens in prompt token count calculation. | Unchanged | +| --push-to-hub Push the processed dataset to Hugging Face Hub. | Unchanged | +| --hub-dataset-id TEXT Hugging Face Hub dataset ID for upload (required if `--push-to-hub` is set). | Unchanged | +| --random-seed INTEGER Random seed for reproducible token sampling. [default: 42] | Unchanged | From 20136f20c48a6b8a652c793ad8f277436b4b1020 Mon Sep 17 00:00:00 2001 From: David Butenhof Date: Thu, 25 Jun 2026 16:21:01 -0400 Subject: [PATCH 3/8] Something got mangled ... Signed-off-by: David Butenhof --- docs/guides/v0.7.0_migration_guide.md | 13 +++++-------- 1 file changed, 5 insertions(+), 8 deletions(-) diff --git a/docs/guides/v0.7.0_migration_guide.md b/docs/guides/v0.7.0_migration_guide.md index 50d6abae8..02acf59c2 100644 --- a/docs/guides/v0.7.0_migration_guide.md +++ b/docs/guides/v0.7.0_migration_guide.md @@ -54,14 +54,11 @@ This command is now `guidellm run` Load a saved benchmark report and optionally re-export data -| Option | v0.7.0 equivalent | -| :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :---------------- | -| PATH | Unchanged | -| Path to the saved benchmark report file (default: ./benchmarks.json). | Unchanged | -| --output-path | Unchanged | -| Directory or file path to save re-exported benchmark results. If a directory, all output formats will be saved there. If a file, the matching format will be saved to that file. | Unchanged | -| --output-formats | Unchanged | -| Output formats for benchmark results (e.g., console, json, html, csv). | Unchanged | +| Option | v0.7.0 equivalent | +| :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :---------------- | +| PATH Path to the saved benchmark report file (default: ./benchmarks. | Unchanged | +| --output-path Directory or file path to save re-exported benchmark results. If a directory, all output formats will be saved there. If a file, the matching format will be saved to that file. | Unchanged | +| --output-formats Output formats for benchmark results (e.g., console, json, html, csv). | Unchanged | ## `guidellm config` From c39d7fbe300042bcd405aed6fd0cd1075462c257 Mon Sep 17 00:00:00 2001 From: David Butenhof Date: Thu, 25 Jun 2026 16:29:05 -0400 Subject: [PATCH 4/8] Another cleanup on aisle preprocess Signed-off-by: David Butenhof --- docs/guides/v0.7.0_migration_guide.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/guides/v0.7.0_migration_guide.md b/docs/guides/v0.7.0_migration_guide.md index 02acf59c2..56bab2707 100644 --- a/docs/guides/v0.7.0_migration_guide.md +++ b/docs/guides/v0.7.0_migration_guide.md @@ -101,7 +101,7 @@ Tools for preprocessing datasets for use in benchmarks. | --processor-args TEXT JSON string of arguments to pass to the processor constructor. | Unchanged | | --data-args TEXT JSON string of arguments to pass to dataset creation | Unchanged | | --data-column-mapper JSON string of column mappings to apply to the dataset | Specify a data column mapper object, for example `--data-column-mapper ‘{“kind”:”generative_column_mapper”,”column_mappings”:{“text_column”:”instruction”}}` | -| --short-prompt-strategy \[ignore | concatenate | +| --short-prompt-strategy [ignore, concatenate, pad, error] Strategy for handling prompts shorter than target length. [default: ignore] | Unchanged | | --pad-char TEXT Character to pad short prompts with when using “pad” strategy (used with ‘concatenate’ strategy). | Unchanged | | --concat-delimiter TEXT Delimiter for concatenating short prompts (used with ‘concatenate’ strategy). | Unchanged | | --include-prefix-in-token-count Include prefix tokens in prompt token count calculation. | Unchanged | From 431745d4844087ef7ba05e57f6e065dc56a87d52 Mon Sep 17 00:00:00 2001 From: David Butenhof Date: Thu, 25 Jun 2026 16:36:45 -0400 Subject: [PATCH 5/8] formatting? Signed-off-by: David Butenhof --- docs/guides/datasets.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/guides/datasets.md b/docs/guides/datasets.md index 9bb525783..23be265c7 100644 --- a/docs/guides/datasets.md +++ b/docs/guides/datasets.md @@ -333,7 +333,7 @@ guidellm preprocess dataset \ | Argument | Description | | ------------- | --------------------------------------------------------------------------------------------------------------------------------------------- | -| `DATA` | Identify the dataset to process. Supports all dataset formats documented in the [Dataset Configurations](#datasets). | +| `DATA` | Identify the dataset to process. Supports all dataset formats documented in the [Dataset Configurations](#datasets). | | `OUTPUT_PATH` | Path to save the processed dataset, including file suffix (e.g., `processed_dataset.jsonl`, `output.csv`). | | `--processor` | **Required.** Processor or tokenizer name/path for calculating token counts. Can be a Hugging Face model ID or local path. | | `--config` | **Required.** Configuration specifying target token sizes. Can be a JSON string, key=value pairs, or file path (.json, .yaml, .yml, .config). | From cf491f2948b8ebcac9d27209c4e88a3a460020f3 Mon Sep 17 00:00:00 2001 From: David Butenhof Date: Thu, 25 Jun 2026 20:15:22 -0400 Subject: [PATCH 6/8] Update sample size Signed-off-by: David Butenhof --- docs/guides/v0.7.0_migration_guide.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/guides/v0.7.0_migration_guide.md b/docs/guides/v0.7.0_migration_guide.md index 56bab2707..c47e018f6 100644 --- a/docs/guides/v0.7.0_migration_guide.md +++ b/docs/guides/v0.7.0_migration_guide.md @@ -40,7 +40,7 @@ This command is now `guidellm run` | --random-seed Random seed for reproducibility. | Specify the random seed configuration like `--seed kind=static,value=42` | | --rate Benchmark rate(s) to test. Meaning depends on profile: sweep=number of benchmarks, concurrent=concurrent requests, async/constant/poisson=requests per second. | “Rate” was overloaded to specify the primary configuration for each profile type. Specify with `--profile` or `--override profile.`: async/constant/poisson → `rate`, concurrent → `streams`, sweep → `sweep_size`, throughput → `max_concurrency`. | | --request-format Format to use for requests. Options depend on backend. For vLLM backend: plain (no chat template, text appending only), default-template (use tokenizer default), or a file path / single-line template per vLLM docs. Default: default-templateFor openai backend: http endpoint path (/v1/chat/completions, /v1/completions, /v1/audio/transcriptions, /v1/audio/translations) or alias (e.g. chat_completions); default /v1/chat/completions. | Specify as part of backend configuration, like `--backend kind=openai_http,request_format=/v1/responses` | -| --sample-requests Number of sample requests per status to save. None (default) saves all, recommended: 20. | Specify as part of the profile configuration, for example `--profile kind=concurrent,sample_size=20` | +| --sample-requests Number of sample requests per status to save. None (default) saves all, recommended: 20. | Specify as part of the metrics configuration, for example `--metrics kind=generative,sample_size=20` | | --scenario Builtin scenario name or path to config file. CLI options override scenario settings. | The preferred name is now `--config`, although both `--scenario` and `-c` are aliases, for example `--config chat` or `--config my-scenario.yaml`. | | --target Target backend URL (e.g., [http://localhost:8000](http://localhost:8000)). | Specify as part of backend configuration, for example `--backend kind=openai_http,target=http://localhost:8000` | | --warmup Warmup specification: int, float, or dict as string (json or key=value). Controls time or requests before measurement starts. Numeric in (0, 1): percent of duration or request count. Numeric >=1: duration in seconds or request count. Advanced config: see TransientPhaseConfig schema. | Specify with profile, e.g., `--profile kind=synchronous,warmup=2` for a two second warmup or `--profile ‘{“kind”:”concurrent”,”warmup”:{“mode”:”duration”,”value”:2}}` | From c55bd32924a08cb956bba1585330edee8e2d3b58 Mon Sep 17 00:00:00 2001 From: David Butenhof Date: Fri, 26 Jun 2026 07:02:29 -0400 Subject: [PATCH 7/8] stuff Signed-off-by: David Butenhof --- docs/guides/v0.7.0_migration_guide.md | 56 +++++++++++++-------------- 1 file changed, 28 insertions(+), 28 deletions(-) diff --git a/docs/guides/v0.7.0_migration_guide.md b/docs/guides/v0.7.0_migration_guide.md index c47e018f6..6822c2634 100644 --- a/docs/guides/v0.7.0_migration_guide.md +++ b/docs/guides/v0.7.0_migration_guide.md @@ -8,19 +8,19 @@ This command is now `guidellm run` | v0.6.0 option | v0.7.0 equivalent | | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| --backend-kwargs JSON string of arguments to pass to the backend. E.g., '{"api_key": "apikey-\*", "verify": false}' | Options passed to `--backend`, like `--backend “kind=openai_http,api_key=sk…”` | -| --backend Backend type. Options: vllm_python, openai_http. | Merged with `--backend-kwargs` `--backend ‘{“kind”: “openai_http”, “extras”: {“body”: {“temperature”: 0.6}}}’` | -| --cooldown Cooldown specification: int, float, or dict as string (json or key=value). Controls time or requests after measurement ends. Numeric in (0, 1): percent of duration or request count. Numeric >=1: duration in seconds or request count. Advanced config: see TransientPhaseConfig schema. | Specify with profile, e.g., `--profile kind=synchronous,cooldown=2` for a two second cooldown or `--profile ‘{“kind”:”concurrent”,”cooldown”:{“mode”:”duration”,”value”:2}}` | -| --data-args JSON string of arguments to pass to dataset creation. | Specified with “load_kwargs” as part of data, e.g., `--data ‘{“kind”:”huggingface”,”load_kwargs”:{“split”:”train”}}` | -| --data-column-mapper JSON string of column mappings to apply to the dataset. E.g., '{"text_column": "article", "output_tokens_count_column" :"output_tokens"}' | Data column mappers have a “kind”: `--data-column-mapper ‘{“kind”:”generative_column_mapper”,”column_mappings”:{“text_column”:”instruction”}}` | -| --data-finalizer JSON string of finalizer to convert dataset rows to requests. E.g., 'generative' or '{"type": "generative"}' | Use `--data-finalizer kind=generative` | +| --backend-kwargs JSON string of arguments to pass to the backend. E.g., '{"api_key": "apikey-\*", "verify": false}' | Options passed to `--backend`, like `--backend "kind=openai_http,api_key=sk…"` | +| --backend Backend type. Options: vllm_python, openai_http. | Merged with `--backend-kwargs` `--backend '{"kind": "openai_http", "extras": {"body": {"temperature": 0.6}}}'` | +| --cooldown Cooldown specification: int, float, or dict as string (json or key=value). Controls time or requests after measurement ends. Numeric in (0, 1): percent of duration or request count. Numeric >=1: duration in seconds or request count. Advanced config: see TransientPhaseConfig schema. | Specify with profile, e.g., `--profile kind=synchronous,cooldown=2` for a two second cooldown or `--profile '{"kind":"concurrent","cooldown":{"mode":"duration","value":2}}` | +| --data-args JSON string of arguments to pass to dataset creation. | Specified with "load_kwargs" as part of data, e.g., `--data '{"kind":"huggingface","load_kwargs":{"split":"train"}}'` | +| --data-column-mapper JSON string of column mappings to apply to the dataset. E.g., '{"text_column": "article", "output_tokens_count_column" :"output_tokens"}'\` | Data column mappers have a "kind": `--data-column-mapper '{"kind":"generative_column_mapper","column_mappings":{"text_column":"instruction"}}` | +| --data-finalizer JSON string of finalizer to convert dataset rows to requests. E.g., 'generative' or '{"type": "generative"}'\` | Use `--data-finalizer kind=generative` | | --data-num-workers Number of worker processes for data loading. | Specified as part of Data Loader configuration with `--data-loader kind=pytorch,num_workers=3` | -| --data-preprocessors-kwargs JSON string of arguments to pass to all preprocessors. | `--data-preprocessor ‘{“kind”:”encode_media”,”audio_kwargs”:{“format”:”mp3”}}` | +| --data-preprocessors-kwargs JSON string of arguments to pass to all preprocessors. | `--data-preprocessor '{"kind":"encode_media","audio_kwargs":{"format":"mp3"}}'` | | --data-preprocessors List of preprocessors to apply to the dataset. E.g., 'encode_media,my_custom_preprocessor' | `--data-preprocessor kind=encode_media` … can be repeated to configure multiple preprocessors. | | --data-sampler Data sampler type. | Shuffle function is under `--data-loader kind=pytorch,shuffle=true` | | --data-samples Number of samples from dataset. -1 (default) uses all samples and dynamically generates more. | Specify as part of Data Loader configuration, as `--data-loader kind=pytorch,samples=10` | | --data HuggingFace dataset ID, path to dataset, path to data file (csv/json/jsonl/txt), or synthetic data config (json/key=value). | `--data kind=huggingface,source=` `--data kind=csv_file,path=` `--data kind=synthetic_text,prompt_tokens=128,output_tokens=64` | -| --dataloader-kwargs JSON string of arguments to pass to the dataloader constructor. | Passed directly to Data Loader, as `--data_loader kind=pytorch,shuffle=true,samples=100` | +| --dataloader-kwargs JSON string of arguments to pass to the dataloader constructor. | Passed directly to Data Loader, as `--data-loader kind=pytorch,shuffle=true,samples=100` | | --detect-saturation Enable over-saturation detection with default settings. | Enable oversaturation constraint, for example `--constraint kind=over_saturation` | | --disable-console-interactive Disable interactive console progress updates. | Unchanged: `--disable-console-interactive` or `--disable-progress` | | --disable-console Disable all outputs to the console (updates, interactive progress, results). | Unchanged: `--disable-console` or `--disable-console-outputs` | @@ -30,20 +30,20 @@ This command is now `guidellm run` | --max-requests Maximum requests per benchmark. If None, runs until max_seconds or data exhaustion. | Enable maximum requests constraint, for example `--constraint kind=max_requests,count=1000` | | --max-seconds Maximum seconds per benchmark. If None, runs until max_requests or data exhaustion. | Enable maximum duration constraint, for example `--constraint kind=max_duration,seconds=60` | | --model Model ID to benchmark. If not provided, uses first available model. | Specify a model name as part of the backend configuration, for example `--backend kind=openai_http,model=gpt4` | -| --output-dir or –output-path: The directory path to save file output types in | Specify paths as part of the individual output configurations, for example `--output kind=json,path=/tmp/reports/benchmark.json` | +| --output-dir or –-output-path: The directory path to save file output types in | Specify paths as part of the individual output configurations, for example `--output kind=json,path=/tmp/reports/benchmark.json` | | --outputs The filename.ext for each of the outputs to create or the alises (json, csv, html) for the output files to create with their default file names (benchmark.[EXT]) | Specify multiple output formats by repeating the `--output` option, for example `--output kind=json,path=benchmark.json –output kind=csv,path=benchmark.csv` | | --over-saturation Enable over-saturation detection. Pass a JSON dict with configuration (e.g., '{"enabled": true, "min_seconds": 30}'). Defaults to None (disabled). | Enable oversaturation constraint, for example `--constraint kind=over_saturation,mode=enforce,min_seconds=30` | -| --processor-args JSON string of arguments to pass to the processor constructor. | Specify options directly to the tokenizer, for example `--tokenizer ‘{“kind”:”huggingface_auto”,”load_kwargs”:{“fast”:true}}’` | +| --processor-args JSON string of arguments to pass to the processor constructor. | Specify options directly to the tokenizer, for example `--tokenizer '{"kind":"huggingface_auto","load_kwargs":{"fast":true}}'` | | --processor Processor or tokenizer for token count calculations. If not provided, loads from model. | Defaults to the default tokenizer for the first model supported by the backend target. To override, `--tokenizer kind=huggingface_auto,model=gpt4` | | --profile Benchmark profile type. Options: sweep, async, poisson, synchronous, throughput, concurrent, constant. | Specify the benchmark profile to use, for example, `--profile kind=sweep,sweep_size=10,warmup=1,cooldown=1` | | --rampup The time, in seconds, to ramp up the request rate over. Applicable for Throughput, Concurrent, and Constant strategies | Specify as part of profile, for example `--profile kind=constant,rate=10,rampup_duration=2` | | --random-seed Random seed for reproducibility. | Specify the random seed configuration like `--seed kind=static,value=42` | -| --rate Benchmark rate(s) to test. Meaning depends on profile: sweep=number of benchmarks, concurrent=concurrent requests, async/constant/poisson=requests per second. | “Rate” was overloaded to specify the primary configuration for each profile type. Specify with `--profile` or `--override profile.`: async/constant/poisson → `rate`, concurrent → `streams`, sweep → `sweep_size`, throughput → `max_concurrency`. | +| --rate Benchmark rate(s) to test. Meaning depends on profile: sweep=number of benchmarks, concurrent=concurrent requests, async/constant/poisson=requests per second. | "Rate" was overloaded to specify the primary configuration for each profile type. Specify with `--profile` or `--override profile.`: async/constant/poisson → `rate`, concurrent → `streams`, sweep → `sweep_size`, throughput → `max_concurrency`. | | --request-format Format to use for requests. Options depend on backend. For vLLM backend: plain (no chat template, text appending only), default-template (use tokenizer default), or a file path / single-line template per vLLM docs. Default: default-templateFor openai backend: http endpoint path (/v1/chat/completions, /v1/completions, /v1/audio/transcriptions, /v1/audio/translations) or alias (e.g. chat_completions); default /v1/chat/completions. | Specify as part of backend configuration, like `--backend kind=openai_http,request_format=/v1/responses` | | --sample-requests Number of sample requests per status to save. None (default) saves all, recommended: 20. | Specify as part of the metrics configuration, for example `--metrics kind=generative,sample_size=20` | | --scenario Builtin scenario name or path to config file. CLI options override scenario settings. | The preferred name is now `--config`, although both `--scenario` and `-c` are aliases, for example `--config chat` or `--config my-scenario.yaml`. | | --target Target backend URL (e.g., [http://localhost:8000](http://localhost:8000)). | Specify as part of backend configuration, for example `--backend kind=openai_http,target=http://localhost:8000` | -| --warmup Warmup specification: int, float, or dict as string (json or key=value). Controls time or requests before measurement starts. Numeric in (0, 1): percent of duration or request count. Numeric >=1: duration in seconds or request count. Advanced config: see TransientPhaseConfig schema. | Specify with profile, e.g., `--profile kind=synchronous,warmup=2` for a two second warmup or `--profile ‘{“kind”:”concurrent”,”warmup”:{“mode”:”duration”,”value”:2}}` | +| --warmup Warmup specification: int, float, or dict as string (json or key=value). Controls time or requests before measurement starts. Numeric in (0, 1): percent of duration or request count. Numeric >=1: duration in seconds or request count. Advanced config: see TransientPhaseConfig schema. | Specify with profile, e.g., `--profile kind=synchronous,warmup=2` for a two second warmup or `--profile '{"kind":"concurrent","warmup":{"mode":"duration","value":2}}'` | | **NEW OPTIONS** | **v0.7.0 new options** | | :----------------------------------------------------------------------- | :---------------------------------------------------------------------------------------------------------------------------------------------------------------- | @@ -92,19 +92,19 @@ Start a mock OpenAI/vLLM-compatible server for testing. **[NO CHANGE]** Tools for preprocessing datasets for use in benchmarks. -| v0.6.0 option | v0.7.0 equivalent | -| :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :----------------------------------------------------------------------------------------------------------------------------------------------------------- | -| data (positional parameter) | Use dataset descriptor, for example `kind=huggingface,source=` | -| output_path (positional parameter) | Results file path, for example `file.json` | -| --processor TEXT Processor or tokenizer name for calculating token counts. | Unchanged | -| --config TEXT PreprocessDatasetConfig as JSON string, key=value pairs, or file path (.json, .yaml, .yml, .config). Example: `prompt_tokens=100,output_tokens=50,prefix_tokens_max=10` or `{"prompt_tokens": 100, "output_tokens": 50, "prefix_tokens_max": 10}` [Mandatory] | Unchanged | -| --processor-args TEXT JSON string of arguments to pass to the processor constructor. | Unchanged | -| --data-args TEXT JSON string of arguments to pass to dataset creation | Unchanged | -| --data-column-mapper JSON string of column mappings to apply to the dataset | Specify a data column mapper object, for example `--data-column-mapper ‘{“kind”:”generative_column_mapper”,”column_mappings”:{“text_column”:”instruction”}}` | -| --short-prompt-strategy [ignore, concatenate, pad, error] Strategy for handling prompts shorter than target length. [default: ignore] | Unchanged | -| --pad-char TEXT Character to pad short prompts with when using “pad” strategy (used with ‘concatenate’ strategy). | Unchanged | -| --concat-delimiter TEXT Delimiter for concatenating short prompts (used with ‘concatenate’ strategy). | Unchanged | -| --include-prefix-in-token-count Include prefix tokens in prompt token count calculation. | Unchanged | -| --push-to-hub Push the processed dataset to Hugging Face Hub. | Unchanged | -| --hub-dataset-id TEXT Hugging Face Hub dataset ID for upload (required if `--push-to-hub` is set). | Unchanged | -| --random-seed INTEGER Random seed for reproducible token sampling. [default: 42] | Unchanged | +| v0.6.0 option | v0.7.0 equivalent | +| :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| data (positional parameter) | Use dataset descriptor, for example `kind=huggingface,source=` | +| output_path (positional parameter) | Results file path, for example `file.json` | +| --processor TEXT Processor or tokenizer name for calculating token counts. | Unchanged | +| --config TEXT PreprocessDatasetConfig as JSON string, key=value pairs, or file path (.json, .yaml, .yml, .config). Example: `prompt_tokens=100,output_tokens=50,prefix_tokens_max=10` or `{"prompt_tokens": 100, "output_tokens": 50, "prefix_tokens_max": 10}` [Mandatory] | Unchanged | +| --processor-args TEXT JSON string of arguments to pass to the processor constructor. | Unchanged | +| --data-args TEXT JSON string of arguments to pass to dataset creation | Unchanged | +| --data-column-mapper JSON string of column mappings to apply to the dataset | Specify a data column mapper object, for example `--data-column-mapper '{"kind":"generative_column_mapper","column_mappings":{"text_column":"instruction"}}'` | +| --short-prompt-strategy [ignore, concatenate, pad, error] Strategy for handling prompts shorter than target length. [default: ignore] | Unchanged | +| --pad-char TEXT Character to pad short prompts with when using "pad" strategy (used with 'concatenate' strategy). | Unchanged | +| --concat-delimiter TEXT Delimiter for concatenating short prompts (used with 'concatenate' strategy). | Unchanged | +| --include-prefix-in-token-count Include prefix tokens in prompt token count calculation. | Unchanged | +| --push-to-hub Push the processed dataset to Hugging Face Hub. | Unchanged | +| --hub-dataset-id TEXT Hugging Face Hub dataset ID for upload (required if `--push-to-hub` is set). | Unchanged | +| --random-seed INTEGER Random seed for reproducible token sampling. [default: 42] | Unchanged | From 8c417ac2d3b6081de1cbcb0c0c071d71c2c08182 Mon Sep 17 00:00:00 2001 From: David Butenhof Date: Fri, 26 Jun 2026 17:10:13 -0400 Subject: [PATCH 8/8] more Signed-off-by: David Butenhof --- docs/guides/v0.7.0_migration_guide.md | 104 +++++++++++--------------- 1 file changed, 45 insertions(+), 59 deletions(-) diff --git a/docs/guides/v0.7.0_migration_guide.md b/docs/guides/v0.7.0_migration_guide.md index 6822c2634..241cc76d5 100644 --- a/docs/guides/v0.7.0_migration_guide.md +++ b/docs/guides/v0.7.0_migration_guide.md @@ -6,44 +6,44 @@ Run a benchmark against a generative model. This command is now `guidellm run` -| v0.6.0 option | v0.7.0 equivalent | -| :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| --backend-kwargs JSON string of arguments to pass to the backend. E.g., '{"api_key": "apikey-\*", "verify": false}' | Options passed to `--backend`, like `--backend "kind=openai_http,api_key=sk…"` | -| --backend Backend type. Options: vllm_python, openai_http. | Merged with `--backend-kwargs` `--backend '{"kind": "openai_http", "extras": {"body": {"temperature": 0.6}}}'` | -| --cooldown Cooldown specification: int, float, or dict as string (json or key=value). Controls time or requests after measurement ends. Numeric in (0, 1): percent of duration or request count. Numeric >=1: duration in seconds or request count. Advanced config: see TransientPhaseConfig schema. | Specify with profile, e.g., `--profile kind=synchronous,cooldown=2` for a two second cooldown or `--profile '{"kind":"concurrent","cooldown":{"mode":"duration","value":2}}` | -| --data-args JSON string of arguments to pass to dataset creation. | Specified with "load_kwargs" as part of data, e.g., `--data '{"kind":"huggingface","load_kwargs":{"split":"train"}}'` | -| --data-column-mapper JSON string of column mappings to apply to the dataset. E.g., '{"text_column": "article", "output_tokens_count_column" :"output_tokens"}'\` | Data column mappers have a "kind": `--data-column-mapper '{"kind":"generative_column_mapper","column_mappings":{"text_column":"instruction"}}` | -| --data-finalizer JSON string of finalizer to convert dataset rows to requests. E.g., 'generative' or '{"type": "generative"}'\` | Use `--data-finalizer kind=generative` | -| --data-num-workers Number of worker processes for data loading. | Specified as part of Data Loader configuration with `--data-loader kind=pytorch,num_workers=3` | -| --data-preprocessors-kwargs JSON string of arguments to pass to all preprocessors. | `--data-preprocessor '{"kind":"encode_media","audio_kwargs":{"format":"mp3"}}'` | -| --data-preprocessors List of preprocessors to apply to the dataset. E.g., 'encode_media,my_custom_preprocessor' | `--data-preprocessor kind=encode_media` … can be repeated to configure multiple preprocessors. | -| --data-sampler Data sampler type. | Shuffle function is under `--data-loader kind=pytorch,shuffle=true` | -| --data-samples Number of samples from dataset. -1 (default) uses all samples and dynamically generates more. | Specify as part of Data Loader configuration, as `--data-loader kind=pytorch,samples=10` | -| --data HuggingFace dataset ID, path to dataset, path to data file (csv/json/jsonl/txt), or synthetic data config (json/key=value). | `--data kind=huggingface,source=` `--data kind=csv_file,path=` `--data kind=synthetic_text,prompt_tokens=128,output_tokens=64` | -| --dataloader-kwargs JSON string of arguments to pass to the dataloader constructor. | Passed directly to Data Loader, as `--data-loader kind=pytorch,shuffle=true,samples=100` | -| --detect-saturation Enable over-saturation detection with default settings. | Enable oversaturation constraint, for example `--constraint kind=over_saturation` | -| --disable-console-interactive Disable interactive console progress updates. | Unchanged: `--disable-console-interactive` or `--disable-progress` | -| --disable-console Disable all outputs to the console (updates, interactive progress, results). | Unchanged: `--disable-console` or `--disable-console-outputs` | -| --max-error-rate Maximum error rate before stopping the benchmark. | Enable maximum error rate constraint, for example `--constraint kind=max_error_rate,rate=10` | -| --max-errors Maximum errors before stopping the benchmark. | Enable maximum error count constraint, for example `--constraint kind=max_errors,count=10` | -| --max-global-error-rate Maximum global error rate across all benchmarks. | Enable maximum global error rate constraint, for example `--constraint kind=max_global_error_rate,rate=10,minimum=100` | -| --max-requests Maximum requests per benchmark. If None, runs until max_seconds or data exhaustion. | Enable maximum requests constraint, for example `--constraint kind=max_requests,count=1000` | -| --max-seconds Maximum seconds per benchmark. If None, runs until max_requests or data exhaustion. | Enable maximum duration constraint, for example `--constraint kind=max_duration,seconds=60` | -| --model Model ID to benchmark. If not provided, uses first available model. | Specify a model name as part of the backend configuration, for example `--backend kind=openai_http,model=gpt4` | -| --output-dir or –-output-path: The directory path to save file output types in | Specify paths as part of the individual output configurations, for example `--output kind=json,path=/tmp/reports/benchmark.json` | -| --outputs The filename.ext for each of the outputs to create or the alises (json, csv, html) for the output files to create with their default file names (benchmark.[EXT]) | Specify multiple output formats by repeating the `--output` option, for example `--output kind=json,path=benchmark.json –output kind=csv,path=benchmark.csv` | -| --over-saturation Enable over-saturation detection. Pass a JSON dict with configuration (e.g., '{"enabled": true, "min_seconds": 30}'). Defaults to None (disabled). | Enable oversaturation constraint, for example `--constraint kind=over_saturation,mode=enforce,min_seconds=30` | -| --processor-args JSON string of arguments to pass to the processor constructor. | Specify options directly to the tokenizer, for example `--tokenizer '{"kind":"huggingface_auto","load_kwargs":{"fast":true}}'` | -| --processor Processor or tokenizer for token count calculations. If not provided, loads from model. | Defaults to the default tokenizer for the first model supported by the backend target. To override, `--tokenizer kind=huggingface_auto,model=gpt4` | -| --profile Benchmark profile type. Options: sweep, async, poisson, synchronous, throughput, concurrent, constant. | Specify the benchmark profile to use, for example, `--profile kind=sweep,sweep_size=10,warmup=1,cooldown=1` | -| --rampup The time, in seconds, to ramp up the request rate over. Applicable for Throughput, Concurrent, and Constant strategies | Specify as part of profile, for example `--profile kind=constant,rate=10,rampup_duration=2` | -| --random-seed Random seed for reproducibility. | Specify the random seed configuration like `--seed kind=static,value=42` | -| --rate Benchmark rate(s) to test. Meaning depends on profile: sweep=number of benchmarks, concurrent=concurrent requests, async/constant/poisson=requests per second. | "Rate" was overloaded to specify the primary configuration for each profile type. Specify with `--profile` or `--override profile.`: async/constant/poisson → `rate`, concurrent → `streams`, sweep → `sweep_size`, throughput → `max_concurrency`. | -| --request-format Format to use for requests. Options depend on backend. For vLLM backend: plain (no chat template, text appending only), default-template (use tokenizer default), or a file path / single-line template per vLLM docs. Default: default-templateFor openai backend: http endpoint path (/v1/chat/completions, /v1/completions, /v1/audio/transcriptions, /v1/audio/translations) or alias (e.g. chat_completions); default /v1/chat/completions. | Specify as part of backend configuration, like `--backend kind=openai_http,request_format=/v1/responses` | -| --sample-requests Number of sample requests per status to save. None (default) saves all, recommended: 20. | Specify as part of the metrics configuration, for example `--metrics kind=generative,sample_size=20` | -| --scenario Builtin scenario name or path to config file. CLI options override scenario settings. | The preferred name is now `--config`, although both `--scenario` and `-c` are aliases, for example `--config chat` or `--config my-scenario.yaml`. | -| --target Target backend URL (e.g., [http://localhost:8000](http://localhost:8000)). | Specify as part of backend configuration, for example `--backend kind=openai_http,target=http://localhost:8000` | -| --warmup Warmup specification: int, float, or dict as string (json or key=value). Controls time or requests before measurement starts. Numeric in (0, 1): percent of duration or request count. Numeric >=1: duration in seconds or request count. Advanced config: see TransientPhaseConfig schema. | Specify with profile, e.g., `--profile kind=synchronous,warmup=2` for a two second warmup or `--profile '{"kind":"concurrent","warmup":{"mode":"duration","value":2}}'` | +| v0.6.0 option | v0.7.0 equivalent | +| :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| --backend-kwargs JSON string of arguments to pass to the backend. E.g., '{"api_key": "apikey-\*", "verify": false}' | Options passed to `--backend` constructor, for example `--backend "kind=openai_http,api_key=sk…"` | +| --backend Backend type. Options: vllm_python, openai_http. | The "kind" of the backend specification, for example `--backend '{"kind": "openai_http", "extras": {"body": {"temperature": 0.6}}}'` | +| --cooldown Cooldown specification: int, float, or dict as string (json or key=value). Controls time or requests after measurement ends. Numeric in (0, 1): percent of duration or request count. Numeric >=1: duration in seconds or request count. Advanced config: see TransientPhaseConfig schema. | Specify with the `cooldown` profile attribute, for example `--profile kind=synchronous,cooldown=2` for a two second cooldown or `--profile '{"kind":"concurrent","cooldown":{"mode":"duration","value":2}}` | +| --data-args JSON string of arguments to pass to dataset creation. | Specified with "load_kwargs" data attribute, e.g., `--data '{"kind":"huggingface","load_kwargs":{"split":"train"}}'` | +| --data-column-mapper JSON string of column mappings to apply to the dataset. E.g., '{"text_column": "article", "output_tokens_count_column" :"output_tokens"}'\` | Specify the kind and attributes of a data column mapper, for example `--data-column-mapper '{"kind":"generative_column_mapper","column_mappings":{"text_column":"instruction"}}` | +| --data-finalizer JSON string of finalizer to convert dataset rows to requests. E.g., 'generative' or '{"type": "generative"}'\` | Specify the kind of finalizer, for example `--data-finalizer kind=generative` | +| --data-num-workers Number of worker processes for data loading. | Specified with the Data Loader `num_workers` attribute, for example `--data-loader kind=pytorch,num_workers=3` | +| --data-preprocessors-kwargs JSON string of arguments to pass to all preprocessors. | Add parameters to the data preprocessor constructor, for example `--data-preprocessor '{"kind":"encode_media","audio_kwargs":{"format":"mp3"}}'` | +| --data-preprocessors List of preprocessors to apply to the dataset. E.g., 'encode_media,my_custom_preprocessor' | Specify the preprocessor kind and attributes, for example `--data-preprocessor kind=encode_media` … can be repeated to configure multiple preprocessors. | +| --data-sampler Data sampler type. | Shuffle function is a data loader attribute, for example `--data-loader kind=pytorch,shuffle=true` | +| --data-samples Number of samples from dataset. -1 (default) uses all samples and dynamically generates more. | Specify as part of Data Loader configuration, for example `--data-loader kind=pytorch,samples=10` | +| --data HuggingFace dataset ID, path to dataset, path to data file (csv/json/jsonl/txt), or synthetic data config (json/key=value). | Specify the kind of dataset together with attributes, for example `--data kind=huggingface,source=` `--data kind=csv_file,path=` `--data kind=synthetic_text,prompt_tokens=128,output_tokens=64` | +| --dataloader-kwargs JSON string of arguments to pass to the dataloader constructor. | Passed directly to Data Loader, for example `--data-loader kind=pytorch,shuffle=true,samples=100` | +| --detect-saturation Enable over-saturation detection with default settings. | Specify oversaturation constraint kind and attributes, for example `--constraint kind=over_saturation` | +| --disable-console-interactive Disable interactive console progress updates. | Unchanged: `--disable-console-interactive` or `--disable-progress` | +| --disable-console Disable all outputs to the console (updates, interactive progress, results). | Unchanged: `--disable-console` or `--disable-console-outputs` | +| --max-error-rate Maximum error rate before stopping the benchmark. | Specify maximum error rate constraint kind and attributes, for example `--constraint kind=max_error_rate,rate=10` | +| --max-errors Maximum errors before stopping the benchmark. | Specify maximum error count constraint kind and attributes, for example `--constraint kind=max_errors,count=10` | +| --max-global-error-rate Maximum global error rate across all benchmarks. | Specify maximum global error rate constraint kind and attributes, for example `--constraint kind=max_global_error_rate,rate=10,minimum=100` | +| --max-requests Maximum requests per benchmark. If None, runs until max_seconds or data exhaustion. | Specify maximum requests constraint kind and attributes, for example `--constraint kind=max_requests,count=1000` | +| --max-seconds Maximum seconds per benchmark. If None, runs until max_requests or data exhaustion. | Specify maximum duration constraint kind and attributes, for example `--constraint kind=max_duration,seconds=60` | +| --model Model ID to benchmark. If not provided, uses first available model. | Specify with the `model` attribute of the backend configuration, for example `--backend kind=openai_http,model=gpt4` | +| --output-dir or –-output-path: The directory path to save file output types in | Specify paths as part of the individual output configurations, for example `--output kind=json,path=/tmp/reports/benchmark.json` | +| --outputs The filename.ext for each of the outputs to create or the alises (json, csv, html) for the output files to create with their default file names (benchmark.[EXT]) | Specify multiple output formats by repeating the `--output` option, for example `--output kind=json,path=benchmark.json –output kind=csv,path=benchmark.csv` | +| --over-saturation Enable over-saturation detection. Pass a JSON dict with configuration (e.g., '{"enabled": true, "min_seconds": 30}'). Defaults to None (disabled). | Specify oversaturation constraint kind and attributes, for example `--constraint kind=over_saturation,mode=enforce,min_seconds=30` | +| --processor-args JSON string of arguments to pass to the processor constructor. | Specify options directly along with the tokenizer kind, for example `--tokenizer '{"kind":"huggingface_auto","load_kwargs":{"fast":true}}'` | +| --processor Processor or tokenizer for token count calculations. If not provided, loads from model. | Defaults to the default tokenizer for the first model supported by the backend target. To override, specify the tokener kind and attributes, for example `--tokenizer kind=huggingface_auto,model=gpt4` | +| --profile Benchmark profile type. Options: sweep, async, poisson, synchronous, throughput, concurrent, constant. | Specify the benchmark profile kind and attributes to use, for example, `--profile kind=sweep,sweep_size=10,warmup=1,cooldown=1` | +| --rampup The time, in seconds, to ramp up the request rate over. Applicable for Throughput, Concurrent, and Constant strategies | Specify with the `rampup_duration` profile attribute, for example `--profile kind=constant,rate=10,rampup_duration=2` | +| --random-seed Random seed for reproducibility. | Specify the random seed configuration kind and attributes, for example `--seed kind=static,value=42` | +| --rate Benchmark rate(s) to test. Meaning depends on profile: sweep=number of benchmarks, concurrent=concurrent requests, async/constant/poisson=requests per second. | "Rate" was overloaded to specify the primary configuration for each profile type. Specify the appropriate attribute with `--profile` or `--override profile.`: async/constant/poisson → `rate`, concurrent → `streams`, sweep → `sweep_size`, throughput → `max_concurrency`. | +| --request-format Format to use for requests. Options depend on backend.

For vLLM backend: plain (no chat template, text appending only), default-template (use tokenizer default), or a file path / single-line template per vLLM docs. Default: default-template

For openai backend: http endpoint path (/v1/chat/completions, /v1/completions, /v1/audio/transcriptions, /v1/audio/translations) or alias (e.g. chat_completions); default /v1/chat/completions. | Specify as part of backend configuration, like `--backend kind=openai_http,request_format=/v1/responses` | +| --sample-requests Number of sample requests per status to save. None (default) saves all, recommended: 20. | Specify with the `sample_size` attribute of the metrics configuration, for example `--metrics kind=generative,sample_size=20`. The default if unspecified is to save all samples. | +| --scenario Builtin scenario name or path to config file. CLI options override scenario settings. | The preferred name is now `--config`, although both `--scenario` and `-c` are aliases, for example `--config chat` or `--config my-scenario.yaml`. | +| --target Target backend URL (e.g., [http://localhost:8000](http://localhost:8000)). | Specify with the `target` attribute of the backend configuration, for example `--backend kind=openai_http,target=http://localhost:8000` | +| --warmup Warmup specification: int, float, or dict as string (json or key=value). Controls time or requests before measurement starts. Numeric in (0, 1): percent of duration or request count. Numeric >=1: duration in seconds or request count. Advanced config: see TransientPhaseConfig schema. | Specify with the `warmup` profile attribute, for example `--profile kind=synchronous,warmup=2` for a two second warmup or `--profile '{"kind":"concurrent","warmup":{"mode":"duration","value":2}}'` | | **NEW OPTIONS** | **v0.7.0 new options** | | :----------------------------------------------------------------------- | :---------------------------------------------------------------------------------------------------------------------------------------------------------------- | @@ -52,7 +52,10 @@ This command is now `guidellm run` ## `Guidellm benchmark from-file` -Load a saved benchmark report and optionally re-export data +Load a saved benchmark report and optionally re-export data. + +> [!WARNING]\ +> This command may be changed to be more consistent with the `run` command in the future. | Option | v0.7.0 equivalent | | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :---------------- | @@ -68,30 +71,13 @@ Changed from `guidellm config` to `guidellm env` to clarify that it displays env `guidellm config` will be used later for a different purpose, to generate YAML config files from `run` options. -## `guidellm mock-server` - -Start a mock OpenAI/vLLM-compatible server for testing. **[NO CHANGE]** - -| v0.6.0 option | v0.7.0 equivalent | -| :----------------------------------------------------------------------------------------------- | :---------------- | -| --host TEXT Host address to bind the server to. | Unchanged | -| --port INTEGER Port number to bind the server to. | Unchanged | -| --workers INTEGER Number of worker processes. | Unchanged | -| --model TEXT Name of the model to mock. | Unchanged | -| --processor TEXT Processor or tokenizer to use for requests. | Unchanged | -| --request-latency FLOAT Request latency in seconds for non-streaming requests. | Unchanged | -| --request-latency-std FLOAT Request latency standard deviation in seconds (normal distribution). | Unchanged | -| --ttft-ms FLOAT Time to first token in milliseconds for streaming requests. | Unchanged | -| --ttft-ms-std FLOAT Time to first token standard deviation in milliseconds. | Unchanged | -| --itl-ms FLOAT Inter-token latency in milliseconds for streaming requests. | Unchanged | -| --itl-ms-std FLOAT Inter-token latency standard deviation in milliseconds. | Unchanged | -| --output-tokens INTEGER Number of output tokens for streaming requests. | Unchanged | -| --output-tokens-std FLOAT Output tokens standard deviation (normal distribution). | Unchanged | - ## `guidellm preprocess dataset` Tools for preprocessing datasets for use in benchmarks. +> [!WARNING]\ +> This command may be changed to be more consistent with the `run` command in the future. + | v0.6.0 option | v0.7.0 equivalent | | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------ | | data (positional parameter) | Use dataset descriptor, for example `kind=huggingface,source=` |