diff --git a/docs/guides/backends.md b/docs/guides/backends.md index 69bfe15e5..e998dce92 100644 --- a/docs/guides/backends.md +++ b/docs/guides/backends.md @@ -147,4 +147,4 @@ guidellm run \ ## Expanding Backend Support -GuideLLM is an open platform, and we encourage contributions to extend its backend support. Whether it's adding new server implementations, integrating with Python-based backends, or enhancing existing capabilities, your contributions are welcome. For more details on how to contribute, see the [CONTRIBUTING.md](../../CONTRIBUTING.md) file. +GuideLLM is an open platform, and we encourage contributions to extend its backend support. Whether it's adding new server implementations, integrating with Python-based backends, or enhancing existing capabilities, your contributions are welcome. For more details on how to contribute, see the [CONTRIBUTING.md](https://github.com/vllm-project/guidellm/blob/main/CONTRIBUTING.md) file. diff --git a/docs/guides/datasets.md b/docs/guides/datasets.md index c7f0c8157..23be265c7 100644 --- a/docs/guides/datasets.md +++ b/docs/guides/datasets.md @@ -333,7 +333,7 @@ guidellm preprocess dataset \ | Argument | Description | | ------------- | --------------------------------------------------------------------------------------------------------------------------------------------- | -| `DATA` | Identify the dataset to process. Supports all dataset formats documented in the [Dataset Configurations](../datasets.md). | +| `DATA` | Identify the dataset to process. Supports all dataset formats documented in the [Dataset Configurations](#datasets). | | `OUTPUT_PATH` | Path to save the processed dataset, including file suffix (e.g., `processed_dataset.jsonl`, `output.csv`). | | `--processor` | **Required.** Processor or tokenizer name/path for calculating token counts. Can be a Hugging Face model ID or local path. | | `--config` | **Required.** Configuration specifying target token sizes. Can be a JSON string, key=value pairs, or file path (.json, .yaml, .yml, .config). | diff --git a/docs/guides/embeddings.md b/docs/guides/embeddings.md index d37cc80bb..2558c435d 100644 --- a/docs/guides/embeddings.md +++ b/docs/guides/embeddings.md @@ -37,6 +37,6 @@ guidellm run \ ## See Also -- [Benchmark Profiles](benchmark-profiles.md) - Detailed explanation of all profile types +- [Benchmark Profiles](../getting-started/benchmark.md#benchmark-profiles---profile) - Detailed explanation of all profile types - [Datasets Guide](datasets.md) - Creating and using custom datasets - [Metrics Guide](metrics.md) - Understanding performance metrics diff --git a/docs/guides/v0.7.0_migration_guide.md b/docs/guides/v0.7.0_migration_guide.md new file mode 100644 index 000000000..6822c2634 --- /dev/null +++ b/docs/guides/v0.7.0_migration_guide.md @@ -0,0 +1,110 @@ +# CLI Migration Guide + +## `guidellm benchmark [run]` + +Run a benchmark against a generative model. + +This command is now `guidellm run` + +| v0.6.0 option | v0.7.0 equivalent | +| :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| --backend-kwargs JSON string of arguments to pass to the backend. E.g., '{"api_key": "apikey-\*", "verify": false}' | Options passed to `--backend`, like `--backend "kind=openai_http,api_key=sk…"` | +| --backend Backend type. Options: vllm_python, openai_http. | Merged with `--backend-kwargs` `--backend '{"kind": "openai_http", "extras": {"body": {"temperature": 0.6}}}'` | +| --cooldown Cooldown specification: int, float, or dict as string (json or key=value). Controls time or requests after measurement ends. Numeric in (0, 1): percent of duration or request count. Numeric >=1: duration in seconds or request count. Advanced config: see TransientPhaseConfig schema. | Specify with profile, e.g., `--profile kind=synchronous,cooldown=2` for a two second cooldown or `--profile '{"kind":"concurrent","cooldown":{"mode":"duration","value":2}}` | +| --data-args JSON string of arguments to pass to dataset creation. | Specified with "load_kwargs" as part of data, e.g., `--data '{"kind":"huggingface","load_kwargs":{"split":"train"}}'` | +| --data-column-mapper JSON string of column mappings to apply to the dataset. E.g., '{"text_column": "article", "output_tokens_count_column" :"output_tokens"}'\` | Data column mappers have a "kind": `--data-column-mapper '{"kind":"generative_column_mapper","column_mappings":{"text_column":"instruction"}}` | +| --data-finalizer JSON string of finalizer to convert dataset rows to requests. E.g., 'generative' or '{"type": "generative"}'\` | Use `--data-finalizer kind=generative` | +| --data-num-workers Number of worker processes for data loading. | Specified as part of Data Loader configuration with `--data-loader kind=pytorch,num_workers=3` | +| --data-preprocessors-kwargs JSON string of arguments to pass to all preprocessors. | `--data-preprocessor '{"kind":"encode_media","audio_kwargs":{"format":"mp3"}}'` | +| --data-preprocessors List of preprocessors to apply to the dataset. E.g., 'encode_media,my_custom_preprocessor' | `--data-preprocessor kind=encode_media` … can be repeated to configure multiple preprocessors. | +| --data-sampler Data sampler type. | Shuffle function is under `--data-loader kind=pytorch,shuffle=true` | +| --data-samples Number of samples from dataset. -1 (default) uses all samples and dynamically generates more. | Specify as part of Data Loader configuration, as `--data-loader kind=pytorch,samples=10` | +| --data HuggingFace dataset ID, path to dataset, path to data file (csv/json/jsonl/txt), or synthetic data config (json/key=value). | `--data kind=huggingface,source=` `--data kind=csv_file,path=` `--data kind=synthetic_text,prompt_tokens=128,output_tokens=64` | +| --dataloader-kwargs JSON string of arguments to pass to the dataloader constructor. | Passed directly to Data Loader, as `--data-loader kind=pytorch,shuffle=true,samples=100` | +| --detect-saturation Enable over-saturation detection with default settings. | Enable oversaturation constraint, for example `--constraint kind=over_saturation` | +| --disable-console-interactive Disable interactive console progress updates. | Unchanged: `--disable-console-interactive` or `--disable-progress` | +| --disable-console Disable all outputs to the console (updates, interactive progress, results). | Unchanged: `--disable-console` or `--disable-console-outputs` | +| --max-error-rate Maximum error rate before stopping the benchmark. | Enable maximum error rate constraint, for example `--constraint kind=max_error_rate,rate=10` | +| --max-errors Maximum errors before stopping the benchmark. | Enable maximum error count constraint, for example `--constraint kind=max_errors,count=10` | +| --max-global-error-rate Maximum global error rate across all benchmarks. | Enable maximum global error rate constraint, for example `--constraint kind=max_global_error_rate,rate=10,minimum=100` | +| --max-requests Maximum requests per benchmark. If None, runs until max_seconds or data exhaustion. | Enable maximum requests constraint, for example `--constraint kind=max_requests,count=1000` | +| --max-seconds Maximum seconds per benchmark. If None, runs until max_requests or data exhaustion. | Enable maximum duration constraint, for example `--constraint kind=max_duration,seconds=60` | +| --model Model ID to benchmark. If not provided, uses first available model. | Specify a model name as part of the backend configuration, for example `--backend kind=openai_http,model=gpt4` | +| --output-dir or –-output-path: The directory path to save file output types in | Specify paths as part of the individual output configurations, for example `--output kind=json,path=/tmp/reports/benchmark.json` | +| --outputs The filename.ext for each of the outputs to create or the alises (json, csv, html) for the output files to create with their default file names (benchmark.[EXT]) | Specify multiple output formats by repeating the `--output` option, for example `--output kind=json,path=benchmark.json –output kind=csv,path=benchmark.csv` | +| --over-saturation Enable over-saturation detection. Pass a JSON dict with configuration (e.g., '{"enabled": true, "min_seconds": 30}'). Defaults to None (disabled). | Enable oversaturation constraint, for example `--constraint kind=over_saturation,mode=enforce,min_seconds=30` | +| --processor-args JSON string of arguments to pass to the processor constructor. | Specify options directly to the tokenizer, for example `--tokenizer '{"kind":"huggingface_auto","load_kwargs":{"fast":true}}'` | +| --processor Processor or tokenizer for token count calculations. If not provided, loads from model. | Defaults to the default tokenizer for the first model supported by the backend target. To override, `--tokenizer kind=huggingface_auto,model=gpt4` | +| --profile Benchmark profile type. Options: sweep, async, poisson, synchronous, throughput, concurrent, constant. | Specify the benchmark profile to use, for example, `--profile kind=sweep,sweep_size=10,warmup=1,cooldown=1` | +| --rampup The time, in seconds, to ramp up the request rate over. Applicable for Throughput, Concurrent, and Constant strategies | Specify as part of profile, for example `--profile kind=constant,rate=10,rampup_duration=2` | +| --random-seed Random seed for reproducibility. | Specify the random seed configuration like `--seed kind=static,value=42` | +| --rate Benchmark rate(s) to test. Meaning depends on profile: sweep=number of benchmarks, concurrent=concurrent requests, async/constant/poisson=requests per second. | "Rate" was overloaded to specify the primary configuration for each profile type. Specify with `--profile` or `--override profile.`: async/constant/poisson → `rate`, concurrent → `streams`, sweep → `sweep_size`, throughput → `max_concurrency`. | +| --request-format Format to use for requests. Options depend on backend. For vLLM backend: plain (no chat template, text appending only), default-template (use tokenizer default), or a file path / single-line template per vLLM docs. Default: default-templateFor openai backend: http endpoint path (/v1/chat/completions, /v1/completions, /v1/audio/transcriptions, /v1/audio/translations) or alias (e.g. chat_completions); default /v1/chat/completions. | Specify as part of backend configuration, like `--backend kind=openai_http,request_format=/v1/responses` | +| --sample-requests Number of sample requests per status to save. None (default) saves all, recommended: 20. | Specify as part of the metrics configuration, for example `--metrics kind=generative,sample_size=20` | +| --scenario Builtin scenario name or path to config file. CLI options override scenario settings. | The preferred name is now `--config`, although both `--scenario` and `-c` are aliases, for example `--config chat` or `--config my-scenario.yaml`. | +| --target Target backend URL (e.g., [http://localhost:8000](http://localhost:8000)). | Specify as part of backend configuration, for example `--backend kind=openai_http,target=http://localhost:8000` | +| --warmup Warmup specification: int, float, or dict as string (json or key=value). Controls time or requests before measurement starts. Numeric in (0, 1): percent of duration or request count. Numeric >=1: duration in seconds or request count. Advanced config: see TransientPhaseConfig schema. | Specify with profile, e.g., `--profile kind=synchronous,warmup=2` for a two second warmup or `--profile '{"kind":"concurrent","warmup":{"mode":"duration","value":2}}'` | + +| **NEW OPTIONS** | **v0.7.0 new options** | +| :----------------------------------------------------------------------- | :---------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| Add metadata to output reports | Specify key-value pairs of metadata labels which will be written to the output reports, for example `--label gpu=NVIDIA_Z500 --label creator=Intrepid_Adventurer` | +| Override concurrent profile stream count and async/constant/poisson rate | Specify profile settings to override the default profile settings, for example `--override profile.rate 10,20,30` or `--override profile.streams 10,20,30` | + +## `Guidellm benchmark from-file` + +Load a saved benchmark report and optionally re-export data + +| Option | v0.7.0 equivalent | +| :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :---------------- | +| PATH Path to the saved benchmark report file (default: ./benchmarks. | Unchanged | +| --output-path Directory or file path to save re-exported benchmark results. If a directory, all output formats will be saved there. If a file, the matching format will be saved to that file. | Unchanged | +| --output-formats Output formats for benchmark results (e.g., console, json, html, csv). | Unchanged | + +## `guidellm config` + +Show configuration settings + +Changed from `guidellm config` to `guidellm env` to clarify that it displays environment variables affecting GuideLLM operation. + +`guidellm config` will be used later for a different purpose, to generate YAML config files from `run` options. + +## `guidellm mock-server` + +Start a mock OpenAI/vLLM-compatible server for testing. **[NO CHANGE]** + +| v0.6.0 option | v0.7.0 equivalent | +| :----------------------------------------------------------------------------------------------- | :---------------- | +| --host TEXT Host address to bind the server to. | Unchanged | +| --port INTEGER Port number to bind the server to. | Unchanged | +| --workers INTEGER Number of worker processes. | Unchanged | +| --model TEXT Name of the model to mock. | Unchanged | +| --processor TEXT Processor or tokenizer to use for requests. | Unchanged | +| --request-latency FLOAT Request latency in seconds for non-streaming requests. | Unchanged | +| --request-latency-std FLOAT Request latency standard deviation in seconds (normal distribution). | Unchanged | +| --ttft-ms FLOAT Time to first token in milliseconds for streaming requests. | Unchanged | +| --ttft-ms-std FLOAT Time to first token standard deviation in milliseconds. | Unchanged | +| --itl-ms FLOAT Inter-token latency in milliseconds for streaming requests. | Unchanged | +| --itl-ms-std FLOAT Inter-token latency standard deviation in milliseconds. | Unchanged | +| --output-tokens INTEGER Number of output tokens for streaming requests. | Unchanged | +| --output-tokens-std FLOAT Output tokens standard deviation (normal distribution). | Unchanged | + +## `guidellm preprocess dataset` + +Tools for preprocessing datasets for use in benchmarks. + +| v0.6.0 option | v0.7.0 equivalent | +| :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| data (positional parameter) | Use dataset descriptor, for example `kind=huggingface,source=` | +| output_path (positional parameter) | Results file path, for example `file.json` | +| --processor TEXT Processor or tokenizer name for calculating token counts. | Unchanged | +| --config TEXT PreprocessDatasetConfig as JSON string, key=value pairs, or file path (.json, .yaml, .yml, .config). Example: `prompt_tokens=100,output_tokens=50,prefix_tokens_max=10` or `{"prompt_tokens": 100, "output_tokens": 50, "prefix_tokens_max": 10}` [Mandatory] | Unchanged | +| --processor-args TEXT JSON string of arguments to pass to the processor constructor. | Unchanged | +| --data-args TEXT JSON string of arguments to pass to dataset creation | Unchanged | +| --data-column-mapper JSON string of column mappings to apply to the dataset | Specify a data column mapper object, for example `--data-column-mapper '{"kind":"generative_column_mapper","column_mappings":{"text_column":"instruction"}}'` | +| --short-prompt-strategy [ignore, concatenate, pad, error] Strategy for handling prompts shorter than target length. [default: ignore] | Unchanged | +| --pad-char TEXT Character to pad short prompts with when using "pad" strategy (used with 'concatenate' strategy). | Unchanged | +| --concat-delimiter TEXT Delimiter for concatenating short prompts (used with 'concatenate' strategy). | Unchanged | +| --include-prefix-in-token-count Include prefix tokens in prompt token count calculation. | Unchanged | +| --push-to-hub Push the processed dataset to Hugging Face Hub. | Unchanged | +| --hub-dataset-id TEXT Hugging Face Hub dataset ID for upload (required if `--push-to-hub` is set). | Unchanged | +| --random-seed INTEGER Random seed for reproducible token sampling. [default: 42] | Unchanged |