Add input-embedding quantization example#2830
Conversation
Adds examples/quantization_embedding showing how to quantize a model's input embedding table to weight-only intN (WNA16) with a data-free QuantizationModifier. The recipe targets the `Embedding` module by class name (portable across architectures, independent of module prefix). Embedding quantization is near-lossless and most useful for large-vocabulary models. Verified the flow end-to-end (load -> oneshot -> generate -> save); accuracy table from pythia-1.4b included in the README. The resulting checkpoint loads and runs in vLLM. Co-authored-by: Claude Signed-off-by: Karthik Kothuri <karthikkothuri2009@gmail.com>
|
👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review. Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed. |
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
WalkthroughThis PR adds a new embedding quantization example directory with a runnable Llama3 8B Instruct script and comprehensive documentation. The example demonstrates data-free int4 weight quantization of input embeddings using QuantizationModifier and includes sample accuracy results and composition patterns. ChangesEmbedding Quantization Example
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning Review ran into problems🔥 ProblemsLinked repositories: Your configuration references 1 linked repositories, but your current plan allows 0. Analyzed ``, skipped Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🔴 Require two reviewsWaiting for
This rule is failing.PRs labelled "two-reviews" must have at least two approving reviews before merging.
|
There was a problem hiding this comment.
Code Review
This pull request introduces a new example and documentation for quantizing a model's input embedding table using llm-compressor. The feedback suggests adding device_map="auto" when loading the model in both the example script and the README to prevent potential CPU Out-Of-Memory (OOM) issues. Additionally, it is recommended to replace the invalid Python syntax in the README's composition example with a valid placeholder to ensure copy-pasted code remains functional.
|
|
||
| # Select model and load it. | ||
| model_id = "meta-llama/Meta-Llama-3-8B-Instruct" | ||
| model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto") |
There was a problem hiding this comment.
Specifying device_map='auto' is highly recommended when loading large models like Llama-3-8B. Without it, the model is loaded entirely into CPU RAM first, which can easily trigger Out-Of-Memory (OOM) crashes on standard GPU instances with limited system memory.
| model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto") | |
| model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype="auto") |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@examples/quantization_embedding/README.md`:
- Around line 24-26: Update the quickstart command in the README so it runs from
the repository root: replace the current line "python3 llama3_example.py" with
either an explicit path "python3
examples/quantization_embedding/llama3_example.py" or add a preceding "cd
examples/quantization_embedding && python3 llama3_example.py" step so the
example is runnable as documented.
- Around line 38-40: Add a short note to the README next to the model usage
(referencing model_id and the AutoModelForCausalLM.from_pretrained /
AutoTokenizer.from_pretrained calls) stating that
"meta-llama/Meta-Llama-3-8B-Instruct" is gated on Hugging Face and that users
must have account access and authenticate locally (for example via
huggingface-cli login or setting HF_TOKEN) before running the example; keep it
concise and placed immediately after the example snippet.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 913a4f30-8bec-4a4b-9d9d-f6fc59b4d271
📒 Files selected for processing (2)
examples/quantization_embedding/README.mdexamples/quantization_embedding/llama3_example.py
| ```bash | ||
| python3 llama3_example.py | ||
| ``` |
There was a problem hiding this comment.
Fix quickstart command path to make the example runnable from repo root.
python3 llama3_example.py is not correct from the repository root implied by the installation steps; use an explicit path or add a cd step.
Proposed doc fix
## Quickstart
```bash
-python3 llama3_example.py
+python3 examples/quantization_embedding/llama3_example.py</details>
As per coding guidelines, `**/README.md`: “Ensure installation instructions and examples are correct.”
<!-- suggestion_start -->
<details>
<summary>📝 Committable suggestion</summary>
> ‼️ **IMPORTANT**
> Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
```suggestion
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/quantization_embedding/README.md` around lines 24 - 26, Update the
quickstart command in the README so it runs from the repository root: replace
the current line "python3 llama3_example.py" with either an explicit path
"python3 examples/quantization_embedding/llama3_example.py" or add a preceding
"cd examples/quantization_embedding && python3 llama3_example.py" step so the
example is runnable as documented.
Source: Coding guidelines
| model_id = "meta-llama/Meta-Llama-3-8B-Instruct" | ||
| model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto") | ||
| tokenizer = AutoTokenizer.from_pretrained(model_id) |
There was a problem hiding this comment.
Document the gated-model access prerequisite for Llama 3.
The walkthrough uses meta-llama/Meta-Llama-3-8B-Instruct; readers need an explicit note about required Hugging Face access/token or quickstart can fail unexpectedly.
Proposed doc fix
### 1) Load the model
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained(model_id)+> Note: This model is gated on Hugging Face. Ensure your account has access and
+> authenticate locally (for example via huggingface-cli login) before running.
</details>
As per coding guidelines, `**/README.md`: “Review for clarity, accuracy, and up-to-date information.”
<!-- suggestion_start -->
<details>
<summary>📝 Committable suggestion</summary>
> ‼️ **IMPORTANT**
> Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
```suggestion
### 1) Load the model
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/quantization_embedding/README.md` around lines 38 - 40, Add a short
note to the README next to the model usage (referencing model_id and the
AutoModelForCausalLM.from_pretrained / AutoTokenizer.from_pretrained calls)
stating that "meta-llama/Meta-Llama-3-8B-Instruct" is gated on Hugging Face and
that users must have account access and authenticate locally (for example via
huggingface-cli login or setting HF_TOKEN) before running the example; keep it
concise and placed immediately after the example snippet.
Source: Coding guidelines
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Karthik Kothuri <karthikkothuri2009@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Karthik Kothuri <karthikkothuri2009@gmail.com>
Purpose
Adds
examples/quantization_embedding/— an example showing how to quantize a model's input embedding table to weight-onlyintN(WNA16) with a data-freeQuantizationModifier. vLLM loads these checkpoints (embedding-quant support added in vllm-project/vllm#44340) and runs a fused gather + dequant over the looked-up rows, so the packed table is never densified.The recipe targets the
Embeddingmodule by class name (["Embedding"]), which is portable across architectures and independent of a model's module prefix. This matters because name-based targets (e.g.re:.*embed_tokens$) require the model to forwardprefixto itsVocabParallelEmbedding, which not all vLLM models do (see vllm-project/vllm#45535).Embedding quantization is weight-only and data-free (no calibration set), near-lossless, and most useful for large-vocabulary models where the embedding table is a meaningful fraction of memory.
Changes
examples/quantization_embedding/llama3_example.py— data-free embedding quant (int4, group size 64), sample generation, compressed save.examples/quantization_embedding/README.md— walkthrough, channel / 8-bit variants, how to compose with linear-weight quantization, and an accuracy table.Testing
Ran the example flow end-to-end (load →
oneshot→dispatch_model→ generate → save); exits clean. Accuracy (lm-eval) onpythia-1.4bshows embedding quantization is near-lossless:The example references
meta-llama/Meta-Llama-3-8B-Instructper the repo convention (each scheme folder has a llama3 example); the identical recipe was validated onpythia-1.4bandMistral-7B-v0.1, and the resulting checkpoints load and generate in vLLM.This change was developed with AI assistance (Claude Code). All changed lines were reviewed by the submitter.