Skip to content

[v1] Initialize InputBatch in initialize_kv_cache instead of __init__#45528

Open
wenyili wants to merge 1 commit into
vllm-project:mainfrom
wenyili:init-input-batch-in-initialize-kv-cache
Open

[v1] Initialize InputBatch in initialize_kv_cache instead of __init__#45528
wenyili wants to merge 1 commit into
vllm-project:mainfrom
wenyili:init-input-batch-in-initialize-kv-cache

Conversation

@wenyili

@wenyili wenyili commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Moves InputBatch creation from GPUModelRunner.__init__ to initialize_kv_cache (via the new initialize_input_batch method), so it is built with the final block sizes from kv_cache_config rather than a placeholder value.
  • Removes the now-unnecessary _init_block_sizes / _init_kernel_block_sizes tracking fields and the conditional re-init logic in the old may_reinitialize_input_batch.
  • Fixes a latent bug where cp_kv_cache_interleave_size was omitted from the re-init path.

Background

The early initialization in __init__ was a workaround for the UVA pinned-memory reuse bug reported in #18298. The three-step failure chain was:

  1. CPU offloading (#15354) stored cpu_data only as a Python attribute on the parameter object, backed by a no-op C++ deleter that did not hold a reference to the CPU tensor:
    p._vllm_offloaded_cpu_data = cpu_data   # sole Python reference to pinned memory
    p.data = get_cuda_view_from_cpu_tensor(cpu_data)
  2. process_weights_after_loading (GPTQ, Marlin, FP8, …) replaced parameter objects; the old PackedvLLMParameter was GC'd, releasing cpu_data back to CachingHostAllocator.
  3. InputBatch, if created after load_model, allocated pinned buffers (e.g. block_table_cpu) that reused that freed memory, aliasing live GPTQ weight CUDA views.

The workaround was to create InputBatch before load_model with a placeholder block size (#17945 / #18298), then conditionally re-create it in initialize_kv_cache if block sizes differed (#18593 / #18654). The root cause was described at the time as "unknown reasons".

I confirmed the root cause by temporarily keeping cpu_data alive on the module instead of the parameter:

# vllm/model_executor/models/utils.py  (temporary patch to confirm root cause)
if not hasattr(module, '_vllm_offloaded_cpu_datas'):
    module._vllm_offloaded_cpu_datas = []
module._vllm_offloaded_cpu_datas.append(cpu_data)  # survives parameter replacement
p.data = get_cuda_view_from_cpu_tensor(cpu_data)

This eliminated the corruption in test_cpu_offload_gptq, confirming that premature release of cpu_data was the cause.

Why it is now safe

Two fixes closed the root cause:

  1. C++ lambda in csrc/cuda_view.cu (landed after [Don't merge] Debug failing quantization test with input batch move #18298): the deleter now captures cpu_tensor by value, keeping it alive for the full lifetime of the UVA CUDA view:
    // before: no-op, does not hold cpu_tensor alive
    auto deleter = [](void*) {};
    // after: captures cpu_tensor, increments its C++ refcount
    [base = cpu_tensor](void*) {}
    The pinned memory cannot be returned to CachingHostAllocator as long as any UVA CUDA view is alive — regardless of Python-side GC of the parameter object.
  2. device_loading_context re-offload in vllm/model_executor/model_loader/utils.py: detects parameters whose UVA offload flag was lost during process_weights_after_loading and re-creates the UVA view with the C++ lambda.

PR #36461 confirmed the fix is complete by removing the offload + quantization reinit guard added in #18654.

This is not duplicating an existing PR

Test commands

# Linters (all passed locally)
pre-commit run --files vllm/v1/worker/gpu_model_runner.py

# Unit tests
.venv/bin/python -m pytest tests/v1/worker/ -v
.venv/bin/python -m pytest tests/quantization/test_cpu_offload.py::test_cpu_offload_gptq -v

Results on RTX 4090: tests/v1/worker/ 100 passed, 1 skipped; test_cpu_offload_gptq passed (directly exercises the root cause scenario).

AI assistance was used (Claude) to identify the root cause and draft this change. Every changed line has been reviewed by the human submitter.

🤖 Generated with Claude Code

@wenyili wenyili requested a review from njhill as a code owner June 13, 2026 13:57
@mergify mergify Bot added the v1 label Jun 13, 2026
Move InputBatch creation from GPUModelRunner.__init__ to
initialize_kv_cache (via a new initialize_input_batch method), so it is
built with the final block sizes from kv_cache_config rather than a
placeholder.

The original early initialization was a workaround for a UVA
pinned-memory reuse bug (see vllm-project#18298): GPTQ's process_weights_after_loading
replaced parameter objects, causing the old PackedvLLMParameter (which
held the only Python reference to cpu_data) to be GC'd and its pinned
memory returned to CachingHostAllocator. InputBatch, if created after
load_model, would then reuse that memory for block_table_cpu, aliasing
live GPTQ weight CUDA views.

This is now safe because the C++ lambda in csrc/cuda_view.cu captures
cpu_tensor by value ([base = cpu_tensor](void*){}), keeping it alive for
the lifetime of the UVA CUDA view regardless of Python-side GC. PR vllm-project#36461
confirmed this by removing the offload+quantization reinit guard added in
vllm-project#18654.

The may_reinitialize_input_batch method is renamed to
initialize_input_batch and the conditional block-size comparison is
dropped — InputBatch is always created fresh in initialize_kv_cache.
This also fixes a latent bug where cp_kv_cache_interleave_size was
omitted from the reinit path.

Co-authored-by: Claude
Signed-off-by: liwenyi <liwenyi199111@gmail.com>

Signed-off-by: liwenyi <lwy.lwy@163.com>
@wenyili wenyili force-pushed the init-input-batch-in-initialize-kv-cache branch from cacd7d3 to ab72d84 Compare June 13, 2026 16:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant