[v1] Initialize InputBatch in initialize_kv_cache instead of __init__#45528
Open
wenyili wants to merge 1 commit into
Open
[v1] Initialize InputBatch in initialize_kv_cache instead of __init__#45528wenyili wants to merge 1 commit into
wenyili wants to merge 1 commit into
Conversation
Move InputBatch creation from GPUModelRunner.__init__ to initialize_kv_cache (via a new initialize_input_batch method), so it is built with the final block sizes from kv_cache_config rather than a placeholder. The original early initialization was a workaround for a UVA pinned-memory reuse bug (see vllm-project#18298): GPTQ's process_weights_after_loading replaced parameter objects, causing the old PackedvLLMParameter (which held the only Python reference to cpu_data) to be GC'd and its pinned memory returned to CachingHostAllocator. InputBatch, if created after load_model, would then reuse that memory for block_table_cpu, aliasing live GPTQ weight CUDA views. This is now safe because the C++ lambda in csrc/cuda_view.cu captures cpu_tensor by value ([base = cpu_tensor](void*){}), keeping it alive for the lifetime of the UVA CUDA view regardless of Python-side GC. PR vllm-project#36461 confirmed this by removing the offload+quantization reinit guard added in vllm-project#18654. The may_reinitialize_input_batch method is renamed to initialize_input_batch and the conditional block-size comparison is dropped — InputBatch is always created fresh in initialize_kv_cache. This also fixes a latent bug where cp_kv_cache_interleave_size was omitted from the reinit path. Co-authored-by: Claude Signed-off-by: liwenyi <liwenyi199111@gmail.com> Signed-off-by: liwenyi <lwy.lwy@163.com>
cacd7d3 to
ab72d84
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
InputBatchcreation fromGPUModelRunner.__init__toinitialize_kv_cache(via the newinitialize_input_batchmethod), so it is built with the final block sizes fromkv_cache_configrather than a placeholder value._init_block_sizes/_init_kernel_block_sizestracking fields and the conditional re-init logic in the oldmay_reinitialize_input_batch.cp_kv_cache_interleave_sizewas omitted from the re-init path.Background
The early initialization in
__init__was a workaround for the UVA pinned-memory reuse bug reported in #18298. The three-step failure chain was:#15354) storedcpu_dataonly as a Python attribute on the parameter object, backed by a no-op C++ deleter that did not hold a reference to the CPU tensor:process_weights_after_loading(GPTQ, Marlin, FP8, …) replaced parameter objects; the oldPackedvLLMParameterwas GC'd, releasingcpu_databack toCachingHostAllocator.InputBatch, if created afterload_model, allocated pinned buffers (e.g.block_table_cpu) that reused that freed memory, aliasing live GPTQ weight CUDA views.The workaround was to create
InputBatchbeforeload_modelwith a placeholder block size (#17945 / #18298), then conditionally re-create it ininitialize_kv_cacheif block sizes differed (#18593 / #18654). The root cause was described at the time as "unknown reasons".I confirmed the root cause by temporarily keeping
cpu_dataalive on the module instead of the parameter:This eliminated the corruption in
test_cpu_offload_gptq, confirming that premature release ofcpu_datawas the cause.Why it is now safe
Two fixes closed the root cause:
csrc/cuda_view.cu(landed after [Don't merge] Debug failing quantization test with input batch move #18298): the deleter now capturescpu_tensorby value, keeping it alive for the full lifetime of the UVA CUDA view:CachingHostAllocatoras long as any UVA CUDA view is alive — regardless of Python-side GC of the parameter object.device_loading_contextre-offload invllm/model_executor/model_loader/utils.py: detects parameters whose UVA offload flag was lost duringprocess_weights_after_loadingand re-creates the UVA view with the C++ lambda.PR #36461 confirmed the fix is complete by removing the
offload + quantizationreinit guard added in #18654.This is not duplicating an existing PR
initialize_kv_cachebut was reverted (Revert "[v1] Support multiple KV cache groups in GPU model runner (#17945) #18459) due to the then-unknown root cause.__init__.Test commands
Results on RTX 4090:
tests/v1/worker/100 passed, 1 skipped;test_cpu_offload_gptqpassed (directly exercises the root cause scenario).🤖 Generated with Claude Code