[Autoround][DDP] Add Qwen MoE Example by yiliu30 · Pull Request #2844 · vllm-project/llm-compressor

yiliu30 · 2026-06-21T12:43:17Z

SUMMARY:

~~Depends on Add force_local_cache context manager for independent per-rank model loading compressed-tensors#742 ~~

  
# vllm ({'pretrained': 'Yi30/Qwen3-235B-A22B-Instruct-2507-W4A16-AutoRound-iters100-nsamples256-DDP2', 'tensor_parallel_size': 4, 'max_model_len': 8192, 'max_num_batched_tokens': 32768, 'max_num_seqs': 128, 'add_bos_token': True, 'gpu_memory_utilization': 0.8, 'dtype': 'bfloat16', 'max_gen_toks': 2048, 'enable_prefix_caching': False}), gen_kwargs: ({}), limit: 1000.0, num_fewshot: None, batch_size: 128
# |Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
# |-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
# |gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.956|±  |0.0065|
# |     |       |strict-match    |     5|exact_match|↑  |0.957|±  |0.0064|

TEST PLAN:
"please outline how the changes were tested"

When DDP is initialized before model loading, OffloadCache.cls_from_device selects distributed cache variants. Each register_parameter inside initialize_module_for_quantization triggers offload() which does dist.broadcast + barrier. For large MoE models (e.g. Qwen3-235B) with 100K+ Linear layers x 6 quant params, this means 600K+ collective ops — effectively hanging. Fix: wrap initialize_module_for_quantization in disable_onloading() so new params are stored directly without triggering distributed offload. Verified on Qwen3-235B-A22B: apply_quantization_config dropped from hanging to ~4.3 min.

Signed-off-by: yiliu30 <yi4.liu@intel.com>

- Delete experimental/debug scripts (repro_*.py, test_option*.py) - Delete redundant examples (multi_gpu_torchrun.py, multi_gpu_example.py, fast_pipeline.py, launch scripts) - Delete CHANGES.md (absorbed into DDP_FIXES.md) - Revert dist.py CT version compat change (unrelated to DDP) - Add FX_TRACE_ISSUE.md documentation - Keep: base.py, helpers.py, dev.py, ddp_autoround.py, docs

Signed-off-by: yiliu30 <yi4.liu@intel.com>

github-actions · 2026-06-21T12:43:24Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

coderabbitai · 2026-06-21T12:43:25Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 3ff9fe12-3d11-4b74-abdf-f7660dda24d1

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

gemini-code-assist

Code Review

This pull request introduces a DDP AutoRound quantization example for large MoE models and updates the AutoRound modifier to support multi-GPU group configurations. It also refines offload suspension logic and fixes device indexing in get_main_device(). The review feedback identifies critical issues regarding the use of global rank instead of local rank for calculating GPU indices in multi-node DDP environments, which can lead to out-of-bounds errors. Additionally, a hardcoded user directory path in the example script should be replaced with a relative or configurable path to ensure portability.

gemini-code-assist · 2026-06-21T12:44:45Z

+
+    rank = dist.get_rank() if dist.is_initialized() else 0
+    world_size = dist.get_world_size() if dist.is_initialized() else 1
+    main_gpu = rank * gpus_per_group


Similar to the issue in the modifier, rank here is the global rank (dist.get_rank()). Re-calculating main_gpu using the global rank will result in incorrect local GPU indices on multi-node setups. Please use the local rank to compute main_gpu.

local_rank = int(os.environ.get("LOCAL_RANK", "0")) main_gpu = local_rank * gpus_per_group

Signed-off-by: yiliu30 <yi4.liu@intel.com>

Use force_local_cache() from compressed-tensors instead of monkey-patches

Suppresses DistributedCPUCache per-param broadcast during mass quantization init (each scale/zero_point register_parameter triggers a collective op). Uses try/except ImportError for backwards compat with older compressed-tensors versions.

…l_cache - Remove debug memory dump code from base.py - Remove force_local_cache from on_initialize (matches GPTQ pattern) - Standard load_offloaded_model + auto_offload in example - Verified on 30B (49 layers) and 235B (first 2 layers)

Wrap QuantizationMixin.initialize_quantization with disable_onloading() to suppress DistributedCPUCache's per-param broadcast_object_list+barrier when creating quant params (scale, zero_point). Root cause: with GPUS_PER_GROUP=2, device_map='auto_offload' assigns modules to different GPUs. initialize_qparams creates tensors on varying devices, causing GPU->CPU copy timing to differ between ranks. The paired broadcast_object_list calls desync -> deadlock at barrier. disable_onloading() bypasses the distributed path entirely. Quant params are deterministic across ranks (computed from the same scheme), so no synchronization is needed. Also fix example: save model before destroy_process_group (save_pretrained internally uses broadcast_object_list).

- Extract _move_inputs_to static method for cleaner input device alignment - Move input movement out of if-needs_multi_gpu branch (always correct) - Both ranks participate in save_pretrained (uses broadcast_object_list) - Sample generation moved after destroy_process_group - Restore 235B model path

Signed-off-by: yiliu30 <yi4.liu@intel.com>

yiliu30 added 11 commits June 8, 2026 10:55

Fix AutoRound multi-GPU DDP group handling

abbea15

update

a28e7fc

Signed-off-by: yiliu30 <yi4.liu@intel.com>

update

460b529

Signed-off-by: yiliu30 <yi4.liu@intel.com>

Revert: restore disable_onloading() in trace_subgraphs

3e40140

clean

e1e6c99

Signed-off-by: yiliu30 <yi4.liu@intel.com>

clean

86fc407

Signed-off-by: yiliu30 <yi4.liu@intel.com>

fix

0a7abbd

Signed-off-by: yiliu30 <yi4.liu@intel.com>

update

1422ebc

Signed-off-by: yiliu30 <yi4.liu@intel.com>

fix

3f03bc6

Signed-off-by: yiliu30 <yi4.liu@intel.com>

gemini-code-assist Bot reviewed Jun 21, 2026

View reviewed changes

yiliu30 added 12 commits June 21, 2026 12:55

update

2db2a84

Signed-off-by: yiliu30 <yi4.liu@intel.com>

Simplify ddp_qwen3_moe_example: remove argparse, hardcode model config

7761153

Use force_local_cache() from compressed-tensors instead of monkey-patches

Remove __main__ guard, fix quant_elapsed scope

9e63922

Remove TORCHELASTIC_RUN_ID guard — always run via torchrun

0cae406

fix

56247c8

Signed-off-by: yiliu30 <yi4.liu@intel.com>

update

44ee3b9

Signed-off-by: yiliu30 <yi4.liu@intel.com>

fix

47c58bb

Signed-off-by: yiliu30 <yi4.liu@intel.com>

Merge branch 'main' into fix/autoround-ddp-gpu-groups

cebb438

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Autoround][DDP] Add Qwen MoE Example#2844

[Autoround][DDP] Add Qwen MoE Example#2844
yiliu30 wants to merge 23 commits into
vllm-project:mainfrom
yiliu30:fix/autoround-ddp-gpu-groups

yiliu30 commented Jun 21, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 21, 2026

Uh oh!

coderabbitai Bot commented Jun 21, 2026 •

edited

Loading

Review skipped

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

gemini-code-assist Bot Jun 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

yiliu30 commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jun 21, 2026

Uh oh!

coderabbitai Bot commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist Bot Jun 21, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

yiliu30 commented Jun 21, 2026 •

edited

Loading

coderabbitai Bot commented Jun 21, 2026 •

edited

Loading