CUDA: scale q8->f16 cache reserve on >=112 GiB cards (fixes session OOM on large models) by slackarea · Pull Request #472 · antirez/ds4

slackarea · 2026-06-28T15:24:57Z

cuda_q8_f16_cache_reserve_bytes() returns a flat 512 MiB reserve once total VRAM >= 112 GiB, instead of the 5% / 4 GiB-min rule used below that. The q8->f16 dequant cache is eager and fills HBM down to the reserve, so on a large model the session/context graph allocated after model load OOMs at session creation even though the weights themselves fit. DS4_CUDA_WEIGHT_CACHE_LIMIT_GB does not bound this cache, and loading an MTP model disables it and hides the issue.

Fix: drop the >= 112 GiB special case so every card uses 5% / 4 GiB-min. This is the CUDA twin of #446 (same bug on the ROCm runtime, q4q2).

Evidence (2x H200 NVL, 143 GB, no NVLink, CUDA 13.2)

A ~252 GB model split across 2 GPUs (--role coordinator --layers 0:36 + --role worker --layers 37:output), no SSD streaming:

Before (flat 512 MiB reserve): handshake + route succeed, then CUDA tensor alloc failed: out of memory at session create — cached=14.37 GiB free=0.54 GiB reserve=0.50 GiB.
After (this PR, 5% = ~6.99 GiB reserve): cached=7.87 GiB free=7.04 GiB reserve=6.99 GiB, session creates, runs fully resident — prefill 39 t/s, generation 16 t/s.

Non-regression (DeepSeek-V4 Flash, single H200, IQ2XXS)

Same build: prefill 97.0 t/s, generation 40.8 t/s (matches baseline). The change only enlarges the optional cache's reserve on >= 112 GiB cards; correctness is unaffected (logits identical — the cache is a dequant acceleration path), and on a single card a small model leaves plenty of free VRAM so the reserve does not bind.

🤖 Generated with Claude Code

cuda_q8_f16_cache_reserve_bytes() returned a flat 512 MiB reserve once total VRAM >= 112 GiB, instead of the 5% / 4 GiB-min rule used below that. The q8->f16 dequant cache is eager and fills HBM down to the reserve, so on a large model the session/context graph allocated after model load OOMs at session creation even though the weights themselves fit. WEIGHT_CACHE_LIMIT_GB does not bound this cache, and loading an MTP model disables it and hides the issue. Drop the >=112 GiB special case so every card uses 5% / 4 GiB-min. This is the CUDA twin of antirez#446 (same bug on the ROCm runtime, q4q2). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01AQVgY7rXrksjtBjPFSCnMH

slackarea mentioned this pull request Jun 28, 2026

GLM-5.2 (GlmMoeDsa, ~744B) runs on DS4 across 2x H200 — report + reusable findings #473

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA: scale q8->f16 cache reserve on >=112 GiB cards (fixes session OOM on large models)#472

CUDA: scale q8->f16 cache reserve on >=112 GiB cards (fixes session OOM on large models)#472
slackarea wants to merge 1 commit into
antirez:mainfrom
vcnngr:fix-cuda-q8f16-reserve

slackarea commented Jun 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

slackarea commented Jun 28, 2026

Evidence (2x H200 NVL, 143 GB, no NVLink, CUDA 13.2)

Non-regression (DeepSeek-V4 Flash, single H200, IQ2XXS)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant