Remove unnecessary `load_weights` methods by hmellor · Pull Request #44589 · vllm-project/vllm

hmellor · 2026-06-04T23:47:50Z

This PR adds missing functionality to AutoWeightsLoader which allows us to delete the load_weights method boilerplate from 41 architectures in vLLM. Every one of these architectures can automatically load:

GPTQ checkpoints with correct bias skipping
FP8 checkpoints with various scale formats
Checkpoints with fused or sharded qkv_proj/gate_up_proj weights
LoRA weights

The specific changes are:

Enables MergedColumnParallelLinear and QKVParallelLinear to load themselves from fused or unfused checkpoints without any special logic provided that the checkpoint weights are mapped correctly
- The mappings for qkv_proj look like this, which maps checkpoint name to a shard in QKVParallelLinear:
```
hf_to_vllm_mapper = WeightsMapper(
  orig_to_new_substr={
      ".q_proj": ".qkv_proj.q",
      ".k_proj": ".qkv_proj.k",
      ".v_proj": ".qkv_proj.v",
  }
)
```
- The mappings for gate_up_proj look like this, which maps checkpoint name to a shard in MergedColumnParallelLinear:
```
hf_to_vllm_mapper = WeightsMapper(
    orig_to_new_substr={
        ".gate_proj": ".gate_up_proj.0",
        ".up_proj": ".gate_up_proj.1",
    }
)
```
Update ColumnParallelLinearWithLoRA, MergedColumnParallelLinearWithLoRA, QKVParallelLinearWithLoRA and MergedQKVParallelLinearWithLoRA to work when packed_modules_mapping no longer exists as a class variable of the model
Updates QuantizationConfig to include the mappings and skip unexpecteds from maybe_remap_kv_scale_name (this function must stay until all models can use AutoWeightsLoader)
Add unexpected GPTQ bias skipping to AutoWeightsLoader

This change actually found a latent bug in the layerwise online-quantization accounting:

Fp8OnlineLinearMethod.create_weights calls initialize_online_processing(layer), which snapshots load_numel_total = get_layer_size(layer) and wraps the weight loaders of tensors that exist at that moment. But ColumnParallelLinear.__init__ registers self.bias after create_weights returns — so the bias was excluded from the expected total and its loader never wrapped.
For OPT (qkv biases), the counter therefore hits the weight-only total at the last weight shard, and _layerwise_process finalizes the layer before the q bias arrives: it materializes the meta weight, replays the buffered shards, quantizes, and runs the Marlin prep — which permutes the bias for the kernel epilogue and replaces the param with a bare Parameter.
The trailing q-bias load then hits that post-processed param (no output_dim, shape [2304] vs [768]) → the assert in QKVParallelLinear.weight_loader.

On main, OPT's old dict-based loader held a stale params_dict snapshot, so the late bias write went into the dead pre-Marlin tensor — silent corruption that the test never caught (it only checks dtypes, explicitly not accuracy). The branch's delegation path fetches the live param at load time, which turned the silent bug into a crash.

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

mergify · 2026-06-05T12:25:40Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @hmellor.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork