Add input-embedding quantization example by KKothuri · Pull Request #2830 · vllm-project/llm-compressor

KKothuri · 2026-06-13T17:01:02Z

Purpose

Adds examples/quantization_embedding/ — an example showing how to quantize a model's input embedding table to weight-only intN (WNA16) with a data-free QuantizationModifier. vLLM loads these checkpoints (embedding-quant support added in vllm-project/vllm#44340) and runs a fused gather + dequant over the looked-up rows, so the packed table is never densified.

The recipe targets the Embedding module by class name (["Embedding"]), which is portable across architectures and independent of a model's module prefix. This matters because name-based targets (e.g. re:.*embed_tokens$) require the model to forward prefix to its VocabParallelEmbedding, which not all vLLM models do (see vllm-project/vllm#45535).

Embedding quantization is weight-only and data-free (no calibration set), near-lossless, and most useful for large-vocabulary models where the embedding table is a meaningful fraction of memory.

Changes

examples/quantization_embedding/llama3_example.py — data-free embedding quant (int4, group size 64), sample generation, compressed save.
examples/quantization_embedding/README.md — walkthrough, channel / 8-bit variants, how to compose with linear-weight quantization, and an accuracy table.

Testing

Ran the example flow end-to-end (load → oneshot → dispatch_model → generate → save); exits clean. Accuracy (lm-eval) on pythia-1.4b shows embedding quantization is near-lossless:

scheme	wikitext ppl	arc_easy acc
baseline (fp16)	14.733	0.6048
embedding W8 channel	14.732	0.6052
embedding W4 group-64	14.752	0.6061

The example references meta-llama/Meta-Llama-3-8B-Instruct per the repo convention (each scheme folder has a llama3 example); the identical recipe was validated on pythia-1.4b and Mistral-7B-v0.1, and the resulting checkpoints load and generate in vLLM.

This change was developed with AI assistance (Claude Code). All changed lines were reviewed by the submitter.

Adds examples/quantization_embedding showing how to quantize a model's input embedding table to weight-only intN (WNA16) with a data-free QuantizationModifier. The recipe targets the `Embedding` module by class name (portable across architectures, independent of module prefix). Embedding quantization is near-lossless and most useful for large-vocabulary models. Verified the flow end-to-end (load -> oneshot -> generate -> save); accuracy table from pythia-1.4b included in the README. The resulting checkpoint loads and runs in vLLM. Co-authored-by: Claude Signed-off-by: Karthik Kothuri <karthikkothuri2009@gmail.com>

github-actions · 2026-06-13T17:01:10Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

coderabbitai · 2026-06-13T17:01:20Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 07682f40-159a-40f4-af4c-89a453f81241

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

Walkthrough

This PR adds a new embedding quantization example directory with a runnable Llama3 8B Instruct script and comprehensive documentation. The example demonstrates data-free int4 weight quantization of input embeddings using QuantizationModifier and includes sample accuracy results and composition patterns.

Changes

Embedding Quantization Example

Layer / File(s)	Summary
Llama3 embedding quantization runnable example `examples/quantization_embedding/llama3_example.py`	Loads Llama3 8B Instruct model and tokenizer, defines a QuantizationModifier recipe to quantize Embedding weights to int4 (grouped, symmetric, group size 64), applies quantization via oneshot, generates sample text from `"Hello my name is"`, and saves the quantized model and tokenizer to disk with a descriptor in the output directory name.
Embedding quantization documentation `examples/quantization_embedding/README.md`	Documentation page covers installation, quickstart, code walkthrough (model loading, data-free quantization with strategy notes, checkpoint saving), composition with linear quantization, and a Pythia 1.4B accuracy comparison table.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested labels

enhancement, llama, w4a16

Suggested reviewers

HDCharles
kylesayrs

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title directly describes the main addition of input-embedding quantization examples to the codebase, accurately reflecting the changeset.
Description check	✅ Passed	The description comprehensively explains the purpose, changes, and testing of the embedding quantization examples, closely aligned with the changeset.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Warning

Review ran into problems

🔥 Problems

Linked repositories: Your configuration references 1 linked repositories, but your current plan allows 0. Analyzed ``, skipped vllm-project/compressed-tensors.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

mergify · 2026-06-13T17:01:39Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviews

Waiting for

#approved-reviews-by >= 2

This rule is failing.

PRs labelled "two-reviews" must have at least two approving reviews before merging.

#approved-reviews-by >= 2
#changes-requested-reviews-by = 0

gemini-code-assist

Code Review

This pull request introduces a new example and documentation for quantizing a model's input embedding table using llm-compressor. The feedback suggests adding device_map="auto" when loading the model in both the example script and the README to prevent potential CPU Out-Of-Memory (OOM) issues. Additionally, it is recommended to replace the invalid Python syntax in the README's composition example with a valid placeholder to ensure copy-pasted code remains functional.

gemini-code-assist · 2026-06-13T17:02:25Z

+
+# Select model and load it.
+model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
+model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto")


Specifying device_map='auto' is highly recommended when loading large models like Llama-3-8B. Without it, the model is loaded entirely into CPU RAM first, which can easily trigger Out-Of-Memory (OOM) crashes on standard GPU instances with limited system memory.

Suggested change

model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto")

model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype="auto")

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/quantization_embedding/README.md`:
- Around line 24-26: Update the quickstart command in the README so it runs from
the repository root: replace the current line "python3 llama3_example.py" with
either an explicit path "python3
examples/quantization_embedding/llama3_example.py" or add a preceding "cd
examples/quantization_embedding && python3 llama3_example.py" step so the
example is runnable as documented.
- Around line 38-40: Add a short note to the README next to the model usage
(referencing model_id and the AutoModelForCausalLM.from_pretrained /
AutoTokenizer.from_pretrained calls) stating that
"meta-llama/Meta-Llama-3-8B-Instruct" is gated on Hugging Face and that users
must have account access and authenticate locally (for example via
huggingface-cli login or setting HF_TOKEN) before running the example; keep it
concise and placed immediately after the example snippet.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 913a4f30-8bec-4a4b-9d9d-f6fc59b4d271

📥 Commits

Reviewing files that changed from the base of the PR and between 6d2a090 and efd26ab.

📒 Files selected for processing (2)

examples/quantization_embedding/README.md
examples/quantization_embedding/llama3_example.py

coderabbitai · 2026-06-13T17:04:45Z

+```bash
+python3 llama3_example.py
+```


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fix quickstart command path to make the example runnable from repo root.

python3 llama3_example.py is not correct from the repository root implied by the installation steps; use an explicit path or add a cd step.

Proposed doc fix

## Quickstart ```bash -python3 llama3_example.py +python3 examples/quantization_embedding/llama3_example.py

</details> As per coding guidelines, `**/README.md`: “Ensure installation instructions and examples are correct.”  <details> <summary>📝 Committable suggestion</summary> > ‼️ **IMPORTANT** > Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements. ```suggestion

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/quantization_embedding/README.md` around lines 24 - 26, Update the quickstart command in the README so it runs from the repository root: replace the current line "python3 llama3_example.py" with either an explicit path "python3 examples/quantization_embedding/llama3_example.py" or add a preceding "cd examples/quantization_embedding && python3 llama3_example.py" step so the example is runnable as documented.

Source: Coding guidelines

coderabbitai · 2026-06-13T17:04:45Z

+model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
+model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto")
+tokenizer = AutoTokenizer.from_pretrained(model_id)


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Document the gated-model access prerequisite for Llama 3.

The walkthrough uses meta-llama/Meta-Llama-3-8B-Instruct; readers need an explicit note about required Hugging Face access/token or quickstart can fail unexpectedly.

Proposed doc fix

### 1) Load the model ```python from transformers import AutoModelForCausalLM, AutoTokenizer model_id = "meta-llama/Meta-Llama-3-8B-Instruct" model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto") tokenizer = AutoTokenizer.from_pretrained(model_id)

+> Note: This model is gated on Hugging Face. Ensure your account has access and
+> authenticate locally (for example via huggingface-cli login) before running.

</details> As per coding guidelines, `**/README.md`: “Review for clarity, accuracy, and up-to-date information.”  <details> <summary>📝 Committable suggestion</summary> > ‼️ **IMPORTANT** > Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements. ```suggestion ### 1) Load the model

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/quantization_embedding/README.md` around lines 38 - 40, Add a short note to the README next to the model usage (referencing model_id and the AutoModelForCausalLM.from_pretrained / AutoTokenizer.from_pretrained calls) stating that "meta-llama/Meta-Llama-3-8B-Instruct" is gated on Hugging Face and that users must have account access and authenticate locally (for example via huggingface-cli login or setting HF_TOKEN) before running the example; keep it concise and placed immediately after the example snippet.

Source: Coding guidelines

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Karthik Kothuri <karthikkothuri2009@gmail.com>

mergify Bot added documentation Improvements or additions to documentation two-reviews When a PR requires two reviews labels Jun 13, 2026

coderabbitai Bot added enhancement New feature or request llama For any PR / issue related to Llama herd support w4a16 and removed two-reviews When a PR requires two reviews labels Jun 13, 2026

mergify Bot added the two-reviews When a PR requires two reviews label Jun 13, 2026

gemini-code-assist Bot reviewed Jun 13, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 13, 2026

View reviewed changes

KKothuri and others added 2 commits June 13, 2026 11:07

Update examples/quantization_embedding/llama3_example.py

0da9136

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Karthik Kothuri <karthikkothuri2009@gmail.com>

Update examples/quantization_embedding/README.md

4a8e0f7

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Karthik Kothuri <karthikkothuri2009@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add input-embedding quantization example#2830

Add input-embedding quantization example#2830
KKothuri wants to merge 3 commits into
vllm-project:mainfrom
KKothuri:embedding-quant-example

KKothuri commented Jun 13, 2026

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

coderabbitai Bot commented Jun 13, 2026 •

edited

Loading

Review skipped

Review ran into problems

Uh oh!

mergify Bot commented Jun 13, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 13, 2026

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 13, 2026

Uh oh!

coderabbitai Bot Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto")
	model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype="auto")

Conversation

KKothuri commented Jun 13, 2026

Purpose

Changes

Testing

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

coderabbitai Bot commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

Review ran into problems

Uh oh!

mergify Bot commented Jun 13, 2026

Merge Protections

🔴 Require two reviews

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Jun 13, 2026 •

edited

Loading