Skip to content

CUDA: scale q8->f16 cache reserve on >=112 GiB cards (fixes session OOM on large models)#472

Open
slackarea wants to merge 1 commit into
antirez:mainfrom
vcnngr:fix-cuda-q8f16-reserve
Open

CUDA: scale q8->f16 cache reserve on >=112 GiB cards (fixes session OOM on large models)#472
slackarea wants to merge 1 commit into
antirez:mainfrom
vcnngr:fix-cuda-q8f16-reserve

Conversation

@slackarea

Copy link
Copy Markdown

cuda_q8_f16_cache_reserve_bytes() returns a flat 512 MiB reserve once total VRAM >= 112 GiB, instead of the 5% / 4 GiB-min rule used below that. The q8->f16 dequant cache is eager and fills HBM down to the reserve, so on a large model the session/context graph allocated after model load OOMs at session creation even though the weights themselves fit. DS4_CUDA_WEIGHT_CACHE_LIMIT_GB does not bound this cache, and loading an MTP model disables it and hides the issue.

Fix: drop the >= 112 GiB special case so every card uses 5% / 4 GiB-min. This is the CUDA twin of #446 (same bug on the ROCm runtime, q4q2).

Evidence (2x H200 NVL, 143 GB, no NVLink, CUDA 13.2)

A ~252 GB model split across 2 GPUs (--role coordinator --layers 0:36 + --role worker --layers 37:output), no SSD streaming:

  • Before (flat 512 MiB reserve): handshake + route succeed, then CUDA tensor alloc failed: out of memory at session create — cached=14.37 GiB free=0.54 GiB reserve=0.50 GiB.
  • After (this PR, 5% = ~6.99 GiB reserve): cached=7.87 GiB free=7.04 GiB reserve=6.99 GiB, session creates, runs fully resident — prefill 39 t/s, generation 16 t/s.

Non-regression (DeepSeek-V4 Flash, single H200, IQ2XXS)

Same build: prefill 97.0 t/s, generation 40.8 t/s (matches baseline). The change only enlarges the optional cache's reserve on >= 112 GiB cards; correctness is unaffected (logits identical — the cache is a dequant acceleration path), and on a single card a small model leaves plenty of free VRAM so the reserve does not bind.

🤖 Generated with Claude Code

cuda_q8_f16_cache_reserve_bytes() returned a flat 512 MiB reserve once
total VRAM >= 112 GiB, instead of the 5% / 4 GiB-min rule used below that.
The q8->f16 dequant cache is eager and fills HBM down to the reserve, so on
a large model the session/context graph allocated after model load OOMs at
session creation even though the weights themselves fit. WEIGHT_CACHE_LIMIT_GB
does not bound this cache, and loading an MTP model disables it and hides
the issue.

Drop the >=112 GiB special case so every card uses 5% / 4 GiB-min. This is
the CUDA twin of antirez#446 (same bug on the ROCm runtime, q4q2).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01AQVgY7rXrksjtBjPFSCnMH
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant