Skip to content

[Autoround][DDP] Add Qwen MoE Example#2844

Draft
yiliu30 wants to merge 23 commits into
vllm-project:mainfrom
yiliu30:fix/autoround-ddp-gpu-groups
Draft

[Autoround][DDP] Add Qwen MoE Example#2844
yiliu30 wants to merge 23 commits into
vllm-project:mainfrom
yiliu30:fix/autoround-ddp-gpu-groups

Conversation

@yiliu30

@yiliu30 yiliu30 commented Jun 21, 2026

Copy link
Copy Markdown
Contributor

SUMMARY:

  
# vllm ({'pretrained': 'Yi30/Qwen3-235B-A22B-Instruct-2507-W4A16-AutoRound-iters100-nsamples256-DDP2', 'tensor_parallel_size': 4, 'max_model_len': 8192, 'max_num_batched_tokens': 32768, 'max_num_seqs': 128, 'add_bos_token': True, 'gpu_memory_utilization': 0.8, 'dtype': 'bfloat16', 'max_gen_toks': 2048, 'enable_prefix_caching': False}), gen_kwargs: ({}), limit: 1000.0, num_fewshot: None, batch_size: 128
# |Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
# |-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
# |gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.956|±  |0.0065|
# |     |       |strict-match    |     5|exact_match|↑  |0.957|±  |0.0064|

TEST PLAN:
"please outline how the changes were tested"

yiliu30 added 11 commits June 8, 2026 10:55
When DDP is initialized before model loading,
OffloadCache.cls_from_device selects distributed cache variants.
Each register_parameter inside initialize_module_for_quantization
triggers offload() which does dist.broadcast + barrier. For large
MoE models (e.g. Qwen3-235B) with 100K+ Linear layers x 6 quant
params, this means 600K+ collective ops — effectively hanging.

Fix: wrap initialize_module_for_quantization in disable_onloading()
so new params are stored directly without triggering distributed offload.
Verified on Qwen3-235B-A22B: apply_quantization_config dropped from
hanging to ~4.3 min.
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
- Delete experimental/debug scripts (repro_*.py, test_option*.py)
- Delete redundant examples (multi_gpu_torchrun.py, multi_gpu_example.py,
  fast_pipeline.py, launch scripts)
- Delete CHANGES.md (absorbed into DDP_FIXES.md)
- Revert dist.py CT version compat change (unrelated to DDP)
- Add FX_TRACE_ISSUE.md documentation
- Keep: base.py, helpers.py, dev.py, ddp_autoround.py, docs
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
@github-actions

Copy link
Copy Markdown

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

@coderabbitai

coderabbitai Bot commented Jun 21, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 3ff9fe12-3d11-4b74-abdf-f7660dda24d1

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a DDP AutoRound quantization example for large MoE models and updates the AutoRound modifier to support multi-GPU group configurations. It also refines offload suspension logic and fixes device indexing in get_main_device(). The review feedback identifies critical issues regarding the use of global rank instead of local rank for calculating GPU indices in multi-node DDP environments, which can lead to out-of-bounds errors. Additionally, a hardcoded user directory path in the example script should be replaced with a relative or configurable path to ensure portability.

Comment thread src/llmcompressor/modifiers/autoround/base.py Outdated

rank = dist.get_rank() if dist.is_initialized() else 0
world_size = dist.get_world_size() if dist.is_initialized() else 1
main_gpu = rank * gpus_per_group

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the issue in the modifier, rank here is the global rank (dist.get_rank()). Re-calculating main_gpu using the global rank will result in incorrect local GPU indices on multi-node setups. Please use the local rank to compute main_gpu.

    local_rank = int(os.environ.get("LOCAL_RANK", "0"))
    main_gpu = local_rank * gpus_per_group

Comment thread examples/autoround/ddp/ddp_qwen3_moe_example.py Outdated
yiliu30 added 12 commits June 21, 2026 12:55
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Use force_local_cache() from compressed-tensors instead of monkey-patches
Suppresses DistributedCPUCache per-param broadcast during mass
quantization init (each scale/zero_point register_parameter triggers
a collective op). Uses try/except ImportError for backwards compat
with older compressed-tensors versions.
…l_cache

- Remove debug memory dump code from base.py
- Remove force_local_cache from on_initialize (matches GPTQ pattern)
- Standard load_offloaded_model + auto_offload in example
- Verified on 30B (49 layers) and 235B (first 2 layers)
Wrap QuantizationMixin.initialize_quantization with disable_onloading()
to suppress DistributedCPUCache's per-param broadcast_object_list+barrier
when creating quant params (scale, zero_point).

Root cause: with GPUS_PER_GROUP=2, device_map='auto_offload' assigns
modules to different GPUs. initialize_qparams creates tensors on varying
devices, causing GPU->CPU copy timing to differ between ranks. The paired
broadcast_object_list calls desync -> deadlock at barrier.

disable_onloading() bypasses the distributed path entirely. Quant params
are deterministic across ranks (computed from the same scheme), so no
synchronization is needed.

Also fix example: save model before destroy_process_group (save_pretrained
internally uses broadcast_object_list).
- Extract _move_inputs_to static method for cleaner input device alignment
- Move input movement out of if-needs_multi_gpu branch (always correct)
- Both ranks participate in save_pretrained (uses broadcast_object_list)
- Sample generation moved after destroy_process_group
- Restore 235B model path
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant