[v1] Initialize InputBatch in initialize_kv_cache instead of __init__ by wenyili · Pull Request #45528 · vllm-project/vllm

wenyili · 2026-06-13T13:57:16Z

Summary

Moves InputBatch creation from GPUModelRunner.__init__ to initialize_kv_cache (via the new initialize_input_batch method), so it is built with the final block sizes from kv_cache_config rather than a placeholder value.
Removes the now-unnecessary _init_block_sizes / _init_kernel_block_sizes tracking fields and the conditional re-init logic in the old may_reinitialize_input_batch.
Fixes a latent bug where cp_kv_cache_interleave_size was omitted from the re-init path.

Background

The early initialization in __init__ was a workaround for the UVA pinned-memory reuse bug reported in #18298. The three-step failure chain was:

CPU offloading (#15354) stored cpu_data only as a Python attribute on the parameter object, backed by a no-op C++ deleter that did not hold a reference to the CPU tensor:
```
p._vllm_offloaded_cpu_data = cpu_data   # sole Python reference to pinned memory
p.data = get_cuda_view_from_cpu_tensor(cpu_data)
```
process_weights_after_loading (GPTQ, Marlin, FP8, …) replaced parameter objects; the old PackedvLLMParameter was GC'd, releasing cpu_data back to CachingHostAllocator.
InputBatch, if created after load_model, allocated pinned buffers (e.g. block_table_cpu) that reused that freed memory, aliasing live GPTQ weight CUDA views.

The workaround was to create InputBatch before load_model with a placeholder block size (#17945 / #18298), then conditionally re-create it in initialize_kv_cache if block sizes differed (#18593 / #18654). The root cause was described at the time as "unknown reasons".

I confirmed the root cause by temporarily keeping cpu_data alive on the module instead of the parameter:

# vllm/model_executor/models/utils.py  (temporary patch to confirm root cause)
if not hasattr(module, '_vllm_offloaded_cpu_datas'):
    module._vllm_offloaded_cpu_datas = []
module._vllm_offloaded_cpu_datas.append(cpu_data)  # survives parameter replacement
p.data = get_cuda_view_from_cpu_tensor(cpu_data)

This eliminated the corruption in test_cpu_offload_gptq, confirming that premature release of cpu_data was the cause.

Why it is now safe

Two fixes closed the root cause:

C++ lambda in csrc/cuda_view.cu (landed after [Don't merge] Debug failing quantization test with input batch move #18298): the deleter now captures cpu_tensor by value, keeping it alive for the full lifetime of the UVA CUDA view:
```
// before: no-op, does not hold cpu_tensor alive
auto deleter = [](void*) {};
// after: captures cpu_tensor, increments its C++ refcount
[base = cpu_tensor](void*) {}
```
The pinned memory cannot be returned to CachingHostAllocator as long as any UVA CUDA view is alive — regardless of Python-side GC of the parameter object.
device_loading_context re-offload in vllm/model_executor/model_loader/utils.py: detects parameters whose UVA offload flag was lost during process_weights_after_loading and re-creates the UVA view with the C++ lambda.

PR #36461 confirmed the fix is complete by removing the offload + quantization reinit guard added in #18654.

This is not duplicating an existing PR

[v1] Support multiple KV cache groups in GPU model runner #17945 moved init to initialize_kv_cache but was reverted (Revert "[v1] Support multiple KV cache groups in GPU model runner (#17945) #18459) due to the then-unknown root cause.
[v1] Redo "Support multiple KV cache groups in GPU model runner (#17945)" #18593 re-did the multi-KV-cache-group work while keeping the placeholder init in __init__.
[Bugfix] Fix cpu-offload-gb assertion with non-default block sizes #36461 removed the offload guard but did not complete the cleanup.
No open PR removes the placeholder init entirely.

Test commands

# Linters (all passed locally)
pre-commit run --files vllm/v1/worker/gpu_model_runner.py

# Unit tests
.venv/bin/python -m pytest tests/v1/worker/ -v
.venv/bin/python -m pytest tests/quantization/test_cpu_offload.py::test_cpu_offload_gptq -v

Results on RTX 4090: tests/v1/worker/ 100 passed, 1 skipped; test_cpu_offload_gptq passed (directly exercises the root cause scenario).

AI assistance was used (Claude) to identify the root cause and draft this change. Every changed line has been reviewed by the human submitter.

🤖 Generated with Claude Code

Move InputBatch creation from GPUModelRunner.__init__ to initialize_kv_cache (via a new initialize_input_batch method), so it is built with the final block sizes from kv_cache_config rather than a placeholder. The original early initialization was a workaround for a UVA pinned-memory reuse bug (see vllm-project#18298): GPTQ's process_weights_after_loading replaced parameter objects, causing the old PackedvLLMParameter (which held the only Python reference to cpu_data) to be GC'd and its pinned memory returned to CachingHostAllocator. InputBatch, if created after load_model, would then reuse that memory for block_table_cpu, aliasing live GPTQ weight CUDA views. This is now safe because the C++ lambda in csrc/cuda_view.cu captures cpu_tensor by value ([base = cpu_tensor](void*){}), keeping it alive for the lifetime of the UVA CUDA view regardless of Python-side GC. PR vllm-project#36461 confirmed this by removing the offload+quantization reinit guard added in vllm-project#18654. The may_reinitialize_input_batch method is renamed to initialize_input_batch and the conditional block-size comparison is dropped — InputBatch is always created fresh in initialize_kv_cache. This also fixes a latent bug where cp_kv_cache_interleave_size was omitted from the reinit path. Co-authored-by: Claude Signed-off-by: liwenyi <liwenyi199111@gmail.com> Signed-off-by: liwenyi <lwy.lwy@163.com>

wenyili requested a review from njhill as a code owner June 13, 2026 13:57

mergify Bot added the v1 label Jun 13, 2026

wenyili force-pushed the init-input-batch-in-initialize-kv-cache branch from cacd7d3 to ab72d84 Compare June 13, 2026 16:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[v1] Initialize InputBatch in initialize_kv_cache instead of init#45528

[v1] Initialize InputBatch in initialize_kv_cache instead of init#45528
wenyili wants to merge 1 commit into
vllm-project:mainfrom
wenyili:init-input-batch-in-initialize-kv-cache

wenyili commented Jun 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

wenyili commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Background

Why it is now safe

This is not duplicating an existing PR

Test commands

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

wenyili commented Jun 13, 2026 •

edited

Loading