[Draft]: v0.15.1#24
Conversation
Port of PoC v2 (proof-of-computation) to vLLM 0.15.1 V1 engine. Key changes vs V0 port: - poc_model_runner.py fully rewritten for V1 attention metadata - engine_patch.py: monkey-patches AsyncLLM.poc_request via collective_rpc - layer_hooks.py: ContextVar replaced with global bool for torch.compile - gpu_random.py: batched random generation - generate_queue.py: adapted for V1 request lifecycle 12 files in vllm/poc/
Enforced tokens enable exact token sequence replay for logprob validation. - sampling_params.py: add enforced_token_ids field - validation.py: new EnforcedToken/EnforcedTokens classes - protocol.py, serving.py: wire enforced tokens into chat completion API - gpu_input_batch.py: sampling support for enforced tokens in V1 worker 5 files (1 new + 4 patched)
Mount poc_router alongside main router and clear PoC queue on shutdown.
Python-only overlay on pre-built vLLM v0.15.1 image. Skips C extension compilation since csrc/ is unchanged. Build time: seconds instead of 30-60 minutes.
Made-with: Cursor
Made-with: Cursor
Tg/batch mising fix
PoC v2 logic on this branch (vLLM 0.15.1 V1 engine) is equivalent to: repo: gonka-ai/vllm branch: gm/poc-state commit: e388951 (refactoring) tag: release/v0.9.1-pocv2-post6 Ported from upstream: - NaN/Inf guards in data.py and validation.py (b8dea26) - Callback max retries with drop (e388951) V1-specific adaptations (not in upstream): - engine_patch.py: monkey-patch for AsyncLLM - poc_model_runner.py: V1 attention metadata, KV cache blocks - layer_hooks.py: global bool instead of ContextVar (torch.compile compat) - gpu_random.py: batched GPU operations (bit-identical output)
torch.compile traces input_ids.size() during CUDA graph capture, which crashes with NoneType when input_ids=None. Pass a zero tensor instead — model ignores input_ids when inputs_embeds is provided.
Port changes from upstream e388951: - validation: add k_dim shape check for received vectors - callbacks: add CallbackQueue with bounded concurrency - generate_queue: RPC timeout handling, callback queue integration - routes: add _poc_generation_active flag for chat endpoint
fix(poc): pass dummy input_ids for CUDA graph compatibility
Replace per-nonce forward loop with a single batched model() call. Restores original gonka-source behavior (vLLM 0.9.1) where all nonces were processed together. Includes batch-aware attention metadata, NaN detection, and vectorized output stage.
feat(poc): batch all nonces in single forward pass
Upstream: gonka-ai/vllm branch gm/poc-state @ e388951 Our base: kaitakuai/vllm branch mb/poc-v015 @ 3428ed8 Parity covers: batched forward pass, callback queue, k_dim validation, RPC timeouts, generation-active flag. V1-specific advantages retained: CommonAttentionMetadata with real KV cache blocks, torch.compile-compatible layer hooks (global bool vs ContextVar), GPU-vectorized random ops, two-level NaN detection.
…hed tokens default parameters baked into vllm
Add logprobs mode as request param + on-validation classifier
Default parameters baked
Now the correct deployment is just Logprobes switch merge
Updated vLLM: ghcr.io/product-science/vllm:v0.15.1-alpha1 What's left: chat priority gating, grammar graceful degradation |
Tg/cornercases vllm15
|
UPD: Priority gating mechanism described below has a flaw. We should not drain inferneces, we should abort them. Fix is in the next commit and described in the next comment The last merge implemented the mentioned above Chat priority gating and Grammar graceful degradation Summary
Updated imagesvLLM: ghcr.io/product-science/vllm:v0.15.1-alpha2 |
|
Added tests/gonka for live and unit tests for some important scenarios. Doesn't influence vllm work. |
Made-with: Cursor
|
PoC priority: abort in-flight inference instead of draining Previously, when PoC started, poc_request waited for all in-flight inference to finish naturally (drain). This delayed PoC by however long the current generation took to complete. Now poc_request aborts all in-flight requests immediately via self.abort() before running collective_rpc. Also added the 503 guard to the /v1/completions endpoint (was already on /v1/chat/completions). New imagesvLLM: ghcr.io/product-science/vllm:v0.15.1-alpha3 |
|
vLLM: ghcr.io/product-science/vllm:v0.15.1-alpha5 Updates:
|
|
best result so far at export VLLM_ATTENTION_BACKEND=FLASHINFER
export LD_LIBRARY_PATH=/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH
export VLLM_ALLOW_INSECURE_SERIALIZATION=1
export POC_RPC_TIMEOUT_MS=300000
export POC_BATCH_SIZE_DEFAULT=16args: shows around 1295 nonces / min с from each 4xH100 (2590 nonces / min total) |
PoC v2 + Enforced Sampling — port to vLLM 0.15.1 (V1 engine)
Port of PoC v2 (Proof of Computation) and Enforced Sampling from vLLM 0.9.1 V0 engine (
gm/poc-layer-exp) to vLLM 0.15.1 V1 engine.What's done
enforced_token_ids)TODO
Some recent corner-case handling and integration mechanics are not yet implemented:
Chat priority gating (
4ea4882) — reject inference requests while PoC generation is active to prevent NCCL deadlocksGrammar graceful degradation (
134609f) — disable grammar decoding when enforced tokens conflict with the grammar FSM, instead of failing repeatedlyLogprobs mode auto-detection — experiments showed that
raw_logprobsperform significantly better thanprocessed_logprobsfor inference validation, while vLLM 0.9.1 defaults to processed. We need to classify the logprobs type during validation and switch to the correct mode automaticallyKnown Issues
Cross-GPU-generation validation, specifically Ampere vs Hopper/Blackwell, may produce logprob values beyond the match threshold, especially on long-context prompts. This means nodes running on A100 GPUs have a higher probability of inference invalidation compared to newer architectures. Switching to raw logprobs mode is expected to resolve this.
Usage
Building images
This branch can be used to build a Docker image, which serves as a base for the MLNode image.
vLLM image — build with
Dockerfile.quickin the repo root, or use the prebuilt image.MLNode image — build with this Dockerfile (updated base image + FlashAttention install), or use the prebuilt image.
Running the model
Inside the MLNode container, activate the environment and start the server:
The vLLM model server is launched with:
python3 -m vllm.entrypoints.openai.api_server \ --model Qwen/Qwen3-235B-A22B-Instruct-2507-FP8 \ --dtype auto \ --host 0.0.0.0 \ --port 5001 \ --tensor-parallel-size 4 \ --max-model-len 240000 \ --max-num-batched-tokens 32768 \ --attention-backend FLASHINFER \ --logprobs-mode processed_logprobs \ --compilation-config '{"custom_ops": ["+quant_fp8", "+rms_norm", "+silu_and_mul", "+fused_moe", "+rotary_embedding", "+apply_rotary_emb", "none"]}'Required environment variables:
This PR and the computational experiments are a joint work of @tamazgadaev @baychak @gmorgachev @clanster @qdanik