Skip to content

Add input-embedding quantization example#2830

Open
KKothuri wants to merge 3 commits into
vllm-project:mainfrom
KKothuri:embedding-quant-example
Open

Add input-embedding quantization example#2830
KKothuri wants to merge 3 commits into
vllm-project:mainfrom
KKothuri:embedding-quant-example

Conversation

@KKothuri

Copy link
Copy Markdown

Purpose

Adds examples/quantization_embedding/ — an example showing how to quantize a model's input embedding table to weight-only intN (WNA16) with a data-free QuantizationModifier. vLLM loads these checkpoints (embedding-quant support added in vllm-project/vllm#44340) and runs a fused gather + dequant over the looked-up rows, so the packed table is never densified.

The recipe targets the Embedding module by class name (["Embedding"]), which is portable across architectures and independent of a model's module prefix. This matters because name-based targets (e.g. re:.*embed_tokens$) require the model to forward prefix to its VocabParallelEmbedding, which not all vLLM models do (see vllm-project/vllm#45535).

Embedding quantization is weight-only and data-free (no calibration set), near-lossless, and most useful for large-vocabulary models where the embedding table is a meaningful fraction of memory.

Changes

  • examples/quantization_embedding/llama3_example.py — data-free embedding quant (int4, group size 64), sample generation, compressed save.
  • examples/quantization_embedding/README.md — walkthrough, channel / 8-bit variants, how to compose with linear-weight quantization, and an accuracy table.

Testing

Ran the example flow end-to-end (load → oneshotdispatch_model → generate → save); exits clean. Accuracy (lm-eval) on pythia-1.4b shows embedding quantization is near-lossless:

scheme wikitext ppl arc_easy acc
baseline (fp16) 14.733 0.6048
embedding W8 channel 14.732 0.6052
embedding W4 group-64 14.752 0.6061

The example references meta-llama/Meta-Llama-3-8B-Instruct per the repo convention (each scheme folder has a llama3 example); the identical recipe was validated on pythia-1.4b and Mistral-7B-v0.1, and the resulting checkpoints load and generate in vLLM.


This change was developed with AI assistance (Claude Code). All changed lines were reviewed by the submitter.

Adds examples/quantization_embedding showing how to quantize a model's input
embedding table to weight-only intN (WNA16) with a data-free QuantizationModifier.

The recipe targets the `Embedding` module by class name (portable across
architectures, independent of module prefix). Embedding quantization is
near-lossless and most useful for large-vocabulary models. Verified the flow
end-to-end (load -> oneshot -> generate -> save); accuracy table from pythia-1.4b
included in the README. The resulting checkpoint loads and runs in vLLM.

Co-authored-by: Claude
Signed-off-by: Karthik Kothuri <karthikkothuri2009@gmail.com>
@github-actions

Copy link
Copy Markdown

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

@coderabbitai

coderabbitai Bot commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 07682f40-159a-40f4-af4c-89a453f81241

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Walkthrough

This PR adds a new embedding quantization example directory with a runnable Llama3 8B Instruct script and comprehensive documentation. The example demonstrates data-free int4 weight quantization of input embeddings using QuantizationModifier and includes sample accuracy results and composition patterns.

Changes

Embedding Quantization Example

Layer / File(s) Summary
Llama3 embedding quantization runnable example
examples/quantization_embedding/llama3_example.py
Loads Llama3 8B Instruct model and tokenizer, defines a QuantizationModifier recipe to quantize Embedding weights to int4 (grouped, symmetric, group size 64), applies quantization via oneshot, generates sample text from "Hello my name is", and saves the quantized model and tokenizer to disk with a descriptor in the output directory name.
Embedding quantization documentation
examples/quantization_embedding/README.md
Documentation page covers installation, quickstart, code walkthrough (model loading, data-free quantization with strategy notes, checkpoint saving), composition with linear quantization, and a Pythia 1.4B accuracy comparison table.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested labels

enhancement, llama, w4a16

Suggested reviewers

  • HDCharles
  • kylesayrs
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title directly describes the main addition of input-embedding quantization examples to the codebase, accurately reflecting the changeset.
Description check ✅ Passed The description comprehensively explains the purpose, changes, and testing of the embedding quantization examples, closely aligned with the changeset.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

Review ran into problems

🔥 Problems

Linked repositories: Your configuration references 1 linked repositories, but your current plan allows 0. Analyzed ``, skipped vllm-project/compressed-tensors.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@mergify mergify Bot added documentation Improvements or additions to documentation two-reviews When a PR requires two reviews labels Jun 13, 2026
@mergify

mergify Bot commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviews

Waiting for

  • #approved-reviews-by >= 2
This rule is failing.

PRs labelled "two-reviews" must have at least two approving reviews before merging.

  • #approved-reviews-by >= 2
  • #changes-requested-reviews-by = 0

@coderabbitai coderabbitai Bot added enhancement New feature or request llama For any PR / issue related to Llama herd support w4a16 and removed two-reviews When a PR requires two reviews labels Jun 13, 2026
@mergify mergify Bot added the two-reviews When a PR requires two reviews label Jun 13, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new example and documentation for quantizing a model's input embedding table using llm-compressor. The feedback suggests adding device_map="auto" when loading the model in both the example script and the README to prevent potential CPU Out-Of-Memory (OOM) issues. Additionally, it is recommended to replace the invalid Python syntax in the README's composition example with a valid placeholder to ensure copy-pasted code remains functional.


# Select model and load it.
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Specifying device_map='auto' is highly recommended when loading large models like Llama-3-8B. Without it, the model is loaded entirely into CPU RAM first, which can easily trigger Out-Of-Memory (OOM) crashes on standard GPU instances with limited system memory.

Suggested change
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto")
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype="auto")

Comment thread examples/quantization_embedding/README.md
Comment thread examples/quantization_embedding/README.md Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/quantization_embedding/README.md`:
- Around line 24-26: Update the quickstart command in the README so it runs from
the repository root: replace the current line "python3 llama3_example.py" with
either an explicit path "python3
examples/quantization_embedding/llama3_example.py" or add a preceding "cd
examples/quantization_embedding && python3 llama3_example.py" step so the
example is runnable as documented.
- Around line 38-40: Add a short note to the README next to the model usage
(referencing model_id and the AutoModelForCausalLM.from_pretrained /
AutoTokenizer.from_pretrained calls) stating that
"meta-llama/Meta-Llama-3-8B-Instruct" is gated on Hugging Face and that users
must have account access and authenticate locally (for example via
huggingface-cli login or setting HF_TOKEN) before running the example; keep it
concise and placed immediately after the example snippet.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 913a4f30-8bec-4a4b-9d9d-f6fc59b4d271

📥 Commits

Reviewing files that changed from the base of the PR and between 6d2a090 and efd26ab.

📒 Files selected for processing (2)
  • examples/quantization_embedding/README.md
  • examples/quantization_embedding/llama3_example.py

Comment on lines +24 to +26
```bash
python3 llama3_example.py
```

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fix quickstart command path to make the example runnable from repo root.

python3 llama3_example.py is not correct from the repository root implied by the installation steps; use an explicit path or add a cd step.

Proposed doc fix
 ## Quickstart
 
 ```bash
-python3 llama3_example.py
+python3 examples/quantization_embedding/llama3_example.py
</details>

As per coding guidelines, `**/README.md`: “Ensure installation instructions and examples are correct.”

<!-- suggestion_start -->

<details>
<summary>📝 Committable suggestion</summary>

> ‼️ **IMPORTANT**
> Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

```suggestion

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/quantization_embedding/README.md` around lines 24 - 26, Update the
quickstart command in the README so it runs from the repository root: replace
the current line "python3 llama3_example.py" with either an explicit path
"python3 examples/quantization_embedding/llama3_example.py" or add a preceding
"cd examples/quantization_embedding && python3 llama3_example.py" step so the
example is runnable as documented.

Source: Coding guidelines

Comment on lines +38 to +40
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained(model_id)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Document the gated-model access prerequisite for Llama 3.

The walkthrough uses meta-llama/Meta-Llama-3-8B-Instruct; readers need an explicit note about required Hugging Face access/token or quickstart can fail unexpectedly.

Proposed doc fix
 ### 1) Load the model
 
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
 
 model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
 model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto")
 tokenizer = AutoTokenizer.from_pretrained(model_id)

+> Note: This model is gated on Hugging Face. Ensure your account has access and
+> authenticate locally (for example via huggingface-cli login) before running.

</details>

As per coding guidelines, `**/README.md`: “Review for clarity, accuracy, and up-to-date information.”

<!-- suggestion_start -->

<details>
<summary>📝 Committable suggestion</summary>

> ‼️ **IMPORTANT**
> Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

```suggestion
### 1) Load the model

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/quantization_embedding/README.md` around lines 38 - 40, Add a short
note to the README next to the model usage (referencing model_id and the
AutoModelForCausalLM.from_pretrained / AutoTokenizer.from_pretrained calls)
stating that "meta-llama/Meta-Llama-3-8B-Instruct" is gated on Hugging Face and
that users must have account access and authenticate locally (for example via
huggingface-cli login or setting HF_TOKEN) before running the example; keep it
concise and placed immediately after the example snippet.

Source: Coding guidelines

KKothuri and others added 2 commits June 13, 2026 11:07
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Karthik Kothuri <karthikkothuri2009@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Karthik Kothuri <karthikkothuri2009@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation enhancement New feature or request llama For any PR / issue related to Llama herd support two-reviews When a PR requires two reviews w4a16

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant