[Draft]: v0.15.1 by gmorgachev · Pull Request #24 · gonka-ai/vllm

gmorgachev · 2026-04-04T00:25:55Z

PoC v2 + Enforced Sampling — port to vLLM 0.15.1 (V1 engine)

Port of PoC v2 (Proof of Computation) and Enforced Sampling from vLLM 0.9.1 V0 engine (gm/poc-layer-exp) to vLLM 0.15.1 V1 engine.

What's done

Enforced Sampling ported to vLLM 0.15.1 to enable inference validation (exact token sequence replay via enforced_token_ids)
PoC v2 fully ported: layer hooks, GPU random generation, callback queue, generate queue, validation
Conducted extensive cross-validation experiments between PoC v2 and Inference to verify correctness

TODO

Some recent corner-case handling and integration mechanics are not yet implemented:

Chat priority gating (4ea4882) — reject inference requests while PoC generation is active to prevent NCCL deadlocks
Grammar graceful degradation (134609f) — disable grammar decoding when enforced tokens conflict with the grammar FSM, instead of failing repeatedly
Logprobs mode auto-detection — experiments showed that raw_logprobs perform significantly better than processed_logprobs for inference validation, while vLLM 0.9.1 defaults to processed. We need to classify the logprobs type during validation and switch to the correct mode automatically

Known Issues

Cross-GPU-generation validation, specifically Ampere vs Hopper/Blackwell, may produce logprob values beyond the match threshold, especially on long-context prompts. This means nodes running on A100 GPUs have a higher probability of inference invalidation compared to newer architectures. Switching to raw logprobs mode is expected to resolve this.

Usage

Building images

This branch can be used to build a Docker image, which serves as a base for the MLNode image.

vLLM image — build with Dockerfile.quick in the repo root, or use the prebuilt image.

MLNode image — build with this Dockerfile (updated base image + FlashAttention install), or use the prebuilt image.

Running the model

Inside the MLNode container, activate the environment and start the server:

. /app/packages/api/.venv/bin/activate \
  && cd /app/packages \
  && python -m uvicorn api.app:app --host 0.0.0.0 --port 8080

The vLLM model server is launched with:

python3 -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3-235B-A22B-Instruct-2507-FP8 \
  --dtype auto \
  --host 0.0.0.0 \
  --port 5001 \
  --tensor-parallel-size 4 \
  --max-model-len 240000 \
  --max-num-batched-tokens 32768 \
  --attention-backend FLASHINFER \
  --logprobs-mode processed_logprobs \
  --compilation-config '{"custom_ops": ["+quant_fp8", "+rms_norm", "+silu_and_mul", "+fused_moe", "+rotary_embedding", "+apply_rotary_emb", "none"]}'

Required environment variables:

export VLLM_ATTENTION_BACKEND=FLASHINFER
export VLLM_ALLOW_INSECURE_SERIALIZATION=1

Note: deployment setup is being cleaned up and will be simplified soon.

This PR and the computational experiments are a joint work of @tamazgadaev @baychak @gmorgachev @clanster @qdanik

Port of PoC v2 (proof-of-computation) to vLLM 0.15.1 V1 engine. Key changes vs V0 port: - poc_model_runner.py fully rewritten for V1 attention metadata - engine_patch.py: monkey-patches AsyncLLM.poc_request via collective_rpc - layer_hooks.py: ContextVar replaced with global bool for torch.compile - gpu_random.py: batched random generation - generate_queue.py: adapted for V1 request lifecycle 12 files in vllm/poc/

Enforced tokens enable exact token sequence replay for logprob validation. - sampling_params.py: add enforced_token_ids field - validation.py: new EnforcedToken/EnforcedTokens classes - protocol.py, serving.py: wire enforced tokens into chat completion API - gpu_input_batch.py: sampling support for enforced tokens in V1 worker 5 files (1 new + 4 patched)

Mount poc_router alongside main router and clear PoC queue on shutdown.

Python-only overlay on pre-built vLLM v0.15.1 image. Skips C extension compilation since csrc/ is unchanged. Build time: seconds instead of 30-60 minutes.

Made-with: Cursor

Tg/batch mising fix

- data.py: reject non-finite vectors as mismatch before L2 distance - validation.py: skip non-finite received vectors, count as mismatch - callbacks.py: drop payload after POC_CALLBACK_MAX_RETRIES (default 10) Ported from upstream gonka-ai/vllm commit b8dea26 and e388951.

PoC v2 logic on this branch (vLLM 0.15.1 V1 engine) is equivalent to: repo: gonka-ai/vllm branch: gm/poc-state commit: e388951 (refactoring) tag: release/v0.9.1-pocv2-post6 Ported from upstream: - NaN/Inf guards in data.py and validation.py (b8dea26) - Callback max retries with drop (e388951) V1-specific adaptations (not in upstream): - engine_patch.py: monkey-patch for AsyncLLM - poc_model_runner.py: V1 attention metadata, KV cache blocks - layer_hooks.py: global bool instead of ContextVar (torch.compile compat) - gpu_random.py: batched GPU operations (bit-identical output)

torch.compile traces input_ids.size() during CUDA graph capture, which crashes with NoneType when input_ids=None. Pass a zero tensor instead — model ignores input_ids when inputs_embeds is provided.

Port changes from upstream e388951: - validation: add k_dim shape check for received vectors - callbacks: add CallbackQueue with bounded concurrency - generate_queue: RPC timeout handling, callback queue integration - routes: add _poc_generation_active flag for chat endpoint

fix(poc): pass dummy input_ids for CUDA graph compatibility

Replace per-nonce forward loop with a single batched model() call. Restores original gonka-source behavior (vLLM 0.9.1) where all nonces were processed together. Includes batch-aware attention metadata, NaN detection, and vectorized output stage.

feat(poc): batch all nonces in single forward pass

Upstream: gonka-ai/vllm branch gm/poc-state @ e388951 Our base: kaitakuai/vllm branch mb/poc-v015 @ 3428ed8 Parity covers: batched forward pass, callback queue, k_dim validation, RPC timeouts, generation-active flag. V1-specific advantages retained: CommonAttentionMetadata with real KV cache blocks, torch.compile-compatible layer hooks (global bool vs ContextVar), GPU-vectorized random ops, two-level NaN detection.

This reverts commit 49aabe0, reversing changes made to 44cfa7c.

…hed tokens default parameters baked into vllm

Add logprobs mode as request param + on-validation classifier

tamazgadaev · 2026-04-07T17:39:08Z

Default parameters baked

Now attention_backend=FLASHINFER , logprobs_mode=processed_logprobs, compilation_config='{"custom_ops": ["+quant_fp8", "+rms_norm", "+silu_and_mul", "+fused_moe", "+rotary_embedding", "+apply_rotary_emb", "none"]}' (for Qwen 235B) and max-num-batched-tokens=32768 are baked into default vllm engine parameters
All 4 of them are rewritable by using additional_args in vllm deployment, so anyone can choose their own attention backend easily. The values above were chosen from the perspective of stability and good enough inference and PoC performance

Now the correct deployment is just

python3 -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3-235B-A22B-Instruct-2507-FP8 \
  --dtype auto \
  --tensor-parallel-size 4 \
  --max-model-len 240000

Logprobes switch merge

Adds a per-request logprobs_mode override (raw_logprobs / processed_logprobs) that can be set on each individual API request, overriding the deployment-level --logprobs-mode default. Requests without the override fall back to the deployment default as before.
Supports mixed batches: when requests with different logprobs modes land in the same batch, the sampler computes both raw and processed logprobs and merges per row. Homogeneous batches (all raw or all processed) take the same fast path as before with zero overhead.
Adds automatic logprobs mode detection for validation (enforced sampling) requests: when logprobs_mode is not explicitly set on the request but enforced_tokens are present, a lightweight classifier inspects the top_token_ids to determine whether the original inference used raw or processed logprobs, and sets the mode accordingly. This handles backward compatibility with older vLLM versions that didn't support per-request logprobs_mode.
The classifier works by counting the fraction of top-k token IDs below 4 (Qwen's lowest vocab IDs used as padding in processed mode). A ratio above 10% means processed, below means raw. Validated at 99.95% accuracy across 8,756 rows in prior experiments.
Priority chain: explicit logprobs_mode on the request > auto-detected mode from enforced token IDs > deployment-level --logprobs-mode default.

Updated vLLM: ghcr.io/product-science/vllm:v0.15.1-alpha1
Updated MLnode: ghcr.io/product-science/mlnode:3.0.13-alpha1

What's left: chat priority gating, grammar graceful degradation

Tg/cornercases vllm15

tamazgadaev · 2026-04-09T21:30:29Z

UPD: Priority gating mechanism described below has a flaw. We should not drain inferneces, we should abort them. Fix is in the next commit and described in the next comment

The last merge implemented the mentioned above Chat priority gating and Grammar graceful degradation

Summary

Adds chat priority gating for PoC: when PoC generation is active, the chat completion endpoint returns 503 (service_unavailable) to prevent GPU contention and NCCL deadlocks. The flag is set before the generation task is created and cleared via an add_done_callback on the task, covering normal completion, cancellation, and failure.
Drains in-flight inference before PoC GPU work: poc_request now waits for all unfinished requests to complete before issuing collective_rpc. This is required because execute_poc_forward reuses KV-cache blocks starting from block 0 — any overlap with live inference permanently corrupts model output. The drain is bounded by timeout_ms; on timeout, PoC is skipped with {"skipped": True}.
Removes attention metadata caching in poc_model_runner: the metadata builder's internal state (workspace buffers, page-table references) is mutated by every inference engine step, so reusing stale metadata caused the attention backend to write only a fraction of the expected KV entries, producing all-NaN hidden states. Rebuilding costs <1 ms vs ~15 ms for the model forward.
Adds grammar graceful degradation in xgrammar backend: when a grammar FSM rejects a token (e.g. enforced tokens from validation replay that conflict with the JSON schema), grammar enforcement is disabled for the remainder of that request instead of crashing. The _grammar_failed flag short-circuits accept_tokens, rollback, and fill_bitmask; it resets on reset().
Bakes deployment defaults into vLLM config: FlashInfer attention backend via --attention-backend, processed_logprobs as the default --logprobs-mode, max_num_batched_tokens=32768 for the OpenAI API server, and a Qwen3-MoE-specific custom_ops compilation preset. Also forces custom_ops=all unconditionally instead of branching on inductor/compilation mode.

Updated images

vLLM: ghcr.io/product-science/vllm:v0.15.1-alpha2
MLnode: ghcr.io/product-science/mlnode:3.0.13-alpha2

tamazgadaev · 2026-04-09T21:59:12Z

Added tests/gonka for live and unit tests for some important scenarios. Doesn't influence vllm work.

Made-with: Cursor

tamazgadaev · 2026-04-09T23:38:47Z

PoC priority: abort in-flight inference instead of draining

Previously, when PoC started, poc_request waited for all in-flight inference to finish naturally (drain). This delayed PoC by however long the current generation took to complete.

Now poc_request aborts all in-flight requests immediately via self.abort() before running collective_rpc. Also added the 503 guard to the /v1/completions endpoint (was already on /v1/chat/completions).

New images

vLLM: ghcr.io/product-science/vllm:v0.15.1-alpha3
MLnode: ghcr.io/product-science/mlnode:3.0.13-alpha3

feat: stronger seed

tamazgadaev · 2026-04-14T20:03:33Z

vLLM: ghcr.io/product-science/vllm:v0.15.1-alpha5
MLnode: ghcr.io/product-science/mlnode:3.0.13-alpha5

Updates:

Merge of feat: stronger seed #30 with stronger PoC Seed
Make default gpu_memory_utilization="0.925" to fit into 8xH100
Hardcode dtype='auto' into vllm

gmorgachev · 2026-04-19T16:09:36Z

best result so far at ghcr.io/product-science/mlnode:3.0.13-alpha5, 8xH100

export VLLM_ATTENTION_BACKEND=FLASHINFER
export LD_LIBRARY_PATH=/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH
export VLLM_ALLOW_INSECURE_SERIALIZATION=1
export POC_RPC_TIMEOUT_MS=300000
export POC_BATCH_SIZE_DEFAULT=16

args:

  "additional_args": [
    "--tensor-parallel-size", "4",
    "--max-num-batched-tokens", "65536",
    "--gpu-memory-utilization", "0.92",
    "--max-num-seqs", "128",
    "--max-model-len", "240000",
    "--enable-expert-parallel",
    "--disable-custom-all-reduce",
    "--num-gpu-blocks-override", "15000"
  ]

shows around 1295 nonces / min с from each 4xH100 (2590 nonces / min total)

baychak and others added 27 commits February 18, 2026 10:58

Wire PoC router into OpenAI API server

c8b1d1a

Mount poc_router alongside main router and clear PoC queue on shutdown.

Add overlay Dockerfile for fast PoC builds

d78c265

Python-only overlay on pre-built vLLM v0.15.1 image. Skips C extension compilation since csrc/ is unchanged. Build time: seconds instead of 30-60 minutes.

token id fix

da02b4d

Made-with: Cursor

fix the batch mixing causing mismatches

269a682

Made-with: Cursor

Tg/batch mising fix

0b88099

Tg/batch mising fix

backend fix

69837a9

fix(poc): pass dummy input_ids for CUDA graph compatibility

48c09f5

torch.compile traces input_ids.size() during CUDA graph capture, which crashes with NoneType when input_ids=None. Pass a zero tensor instead — model ignores input_ids when inputs_embeds is provided.

Merge pull request #5 from kaitakuai/fix/poc-dummy-input-ids

49aabe0

fix(poc): pass dummy input_ids for CUDA graph compatibility

Merge pull request #6 from kaitakuai/feat/poc-batched-forward

3428ed8

feat(poc): batch all nonces in single forward pass

Revert "Merge pull request #5 from kaitakuai/fix/poc-dummy-input-ids"

2156692

This reverts commit 49aabe0, reversing changes made to 44cfa7c.

skip compiled fix

a1a057c

fix compilation skip

2abe4f9

add scratchpad

de73476

add scratchpad

cf2417a

add logprobs mode as request param + on-validation classifier

31f6c31

remove unused env variables

fb6fc7b

Attention backend, logprobs mode, compilation config and max num batc…

2182f84

…hed tokens default parameters baked into vllm

Fix attention backend

fadde79

fix default args

0021c0c

Merge pull request #25 from gonka-ai/tg/logprobes_switch

591ff97

Add logprobs mode as request param + on-validation classifier

tamazgadaev added 2 commits April 9, 2026 14:17

Fix cornercases together with attention metadata

300267e

default to all

f71fb45

Merge pull request #27 from gonka-ai/tg/cornercases_vllm15

6f0a305

Tg/cornercases vllm15

tcharchian added this to Upgrade v0.2.12 Apr 9, 2026

github-project-automation Bot moved this to Todo in Upgrade v0.2.12 Apr 9, 2026

tcharchian moved this from Todo to In Progress in Upgrade v0.2.12 Apr 9, 2026

tcharchian mentioned this pull request Apr 9, 2026

[P0] vLLM 0.15.1 gonka-ai/gonka#939

Closed

tamazgadaev added 2 commits April 9, 2026 21:57

Add test suite

e11fe3f

update test

509ce27

Update priority gating drain -> abort

28ff0ba

Made-with: Cursor

tamazgadaev and others added 5 commits April 10, 2026 12:55

hardcode dtype auto

80dadb5

feat: stronger seed

bf45627

Merge pull request #30 from gonka-ai/fi/seed-015

2163850

feat: stronger seed

GPU util 0.925 by default

517d056

fix

dd1ddce

SegovChik mentioned this pull request Apr 27, 2026

perf: compile apply_householder with torch.compile (+10-12% PoC nonces/min) #36

Open

4 tasks

vLLM modification

495cf03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Draft]: v0.15.1#24

[Draft]: v0.15.1#24
gmorgachev wants to merge 39 commits into
release/v0.15.1from
tg/scratchpad_for_mode

gmorgachev commented Apr 4, 2026 •

edited by tamazgadaev

Loading

Uh oh!

tamazgadaev commented Apr 7, 2026 •

edited

Loading

Uh oh!

tamazgadaev commented Apr 9, 2026 •

edited

Loading

Uh oh!

tamazgadaev commented Apr 9, 2026

Uh oh!

tamazgadaev commented Apr 9, 2026 •

edited

Loading

Uh oh!

tamazgadaev commented Apr 14, 2026

Uh oh!

gmorgachev commented Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

gmorgachev commented Apr 4, 2026 • edited by tamazgadaev Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PoC v2 + Enforced Sampling — port to vLLM 0.15.1 (V1 engine)

What's done

TODO

Known Issues

Usage

Building images

Running the model

Uh oh!

tamazgadaev commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Default parameters baked

Logprobes switch merge

Uh oh!

tamazgadaev commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Updated images

Uh oh!

tamazgadaev commented Apr 9, 2026

Uh oh!

tamazgadaev commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

New images

Uh oh!

tamazgadaev commented Apr 14, 2026

Uh oh!

gmorgachev commented Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

gmorgachev commented Apr 4, 2026 •

edited by tamazgadaev

Loading

tamazgadaev commented Apr 7, 2026 •

edited

Loading

tamazgadaev commented Apr 9, 2026 •

edited

Loading

tamazgadaev commented Apr 9, 2026 •

edited

Loading