[Autoround][DDP] Add Qwen MoE Example#2844
Conversation
When DDP is initialized before model loading, OffloadCache.cls_from_device selects distributed cache variants. Each register_parameter inside initialize_module_for_quantization triggers offload() which does dist.broadcast + barrier. For large MoE models (e.g. Qwen3-235B) with 100K+ Linear layers x 6 quant params, this means 600K+ collective ops — effectively hanging. Fix: wrap initialize_module_for_quantization in disable_onloading() so new params are stored directly without triggering distributed offload. Verified on Qwen3-235B-A22B: apply_quantization_config dropped from hanging to ~4.3 min.
- Delete experimental/debug scripts (repro_*.py, test_option*.py) - Delete redundant examples (multi_gpu_torchrun.py, multi_gpu_example.py, fast_pipeline.py, launch scripts) - Delete CHANGES.md (absorbed into DDP_FIXES.md) - Revert dist.py CT version compat change (unrelated to DDP) - Add FX_TRACE_ISSUE.md documentation - Keep: base.py, helpers.py, dev.py, ddp_autoround.py, docs
|
👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review. Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed. |
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a DDP AutoRound quantization example for large MoE models and updates the AutoRound modifier to support multi-GPU group configurations. It also refines offload suspension logic and fixes device indexing in get_main_device(). The review feedback identifies critical issues regarding the use of global rank instead of local rank for calculating GPU indices in multi-node DDP environments, which can lead to out-of-bounds errors. Additionally, a hardcoded user directory path in the example script should be replaced with a relative or configurable path to ensure portability.
|
|
||
| rank = dist.get_rank() if dist.is_initialized() else 0 | ||
| world_size = dist.get_world_size() if dist.is_initialized() else 1 | ||
| main_gpu = rank * gpus_per_group |
There was a problem hiding this comment.
Similar to the issue in the modifier, rank here is the global rank (dist.get_rank()). Re-calculating main_gpu using the global rank will result in incorrect local GPU indices on multi-node setups. Please use the local rank to compute main_gpu.
local_rank = int(os.environ.get("LOCAL_RANK", "0"))
main_gpu = local_rank * gpus_per_groupUse force_local_cache() from compressed-tensors instead of monkey-patches
Suppresses DistributedCPUCache per-param broadcast during mass quantization init (each scale/zero_point register_parameter triggers a collective op). Uses try/except ImportError for backwards compat with older compressed-tensors versions.
…l_cache - Remove debug memory dump code from base.py - Remove force_local_cache from on_initialize (matches GPTQ pattern) - Standard load_offloaded_model + auto_offload in example - Verified on 30B (49 layers) and 235B (first 2 layers)
Wrap QuantizationMixin.initialize_quantization with disable_onloading() to suppress DistributedCPUCache's per-param broadcast_object_list+barrier when creating quant params (scale, zero_point). Root cause: with GPUS_PER_GROUP=2, device_map='auto_offload' assigns modules to different GPUs. initialize_qparams creates tensors on varying devices, causing GPU->CPU copy timing to differ between ranks. The paired broadcast_object_list calls desync -> deadlock at barrier. disable_onloading() bypasses the distributed path entirely. Quant params are deterministic across ranks (computed from the same scheme), so no synchronization is needed. Also fix example: save model before destroy_process_group (save_pretrained internally uses broadcast_object_list).
- Extract _move_inputs_to static method for cleaner input device alignment - Move input movement out of if-needs_multi_gpu branch (always correct) - Both ranks participate in save_pretrained (uses broadcast_object_list) - Sample generation moved after destroy_process_group - Restore 235B model path
SUMMARY:
force_local_cachecontext manager for independent per-rank model loading compressed-tensors#742 ~~TEST PLAN:
"please outline how the changes were tested"