Skip to content

[Draft]: v0.15.1#24

Open
gmorgachev wants to merge 39 commits into
release/v0.15.1from
tg/scratchpad_for_mode
Open

[Draft]: v0.15.1#24
gmorgachev wants to merge 39 commits into
release/v0.15.1from
tg/scratchpad_for_mode

Conversation

@gmorgachev

@gmorgachev gmorgachev commented Apr 4, 2026

Copy link
Copy Markdown

PoC v2 + Enforced Sampling — port to vLLM 0.15.1 (V1 engine)

Port of PoC v2 (Proof of Computation) and Enforced Sampling from vLLM 0.9.1 V0 engine (gm/poc-layer-exp) to vLLM 0.15.1 V1 engine.

What's done

  • Enforced Sampling ported to vLLM 0.15.1 to enable inference validation (exact token sequence replay via enforced_token_ids)
  • PoC v2 fully ported: layer hooks, GPU random generation, callback queue, generate queue, validation
  • Conducted extensive cross-validation experiments between PoC v2 and Inference to verify correctness

TODO

Some recent corner-case handling and integration mechanics are not yet implemented:

  1. Chat priority gating (4ea4882) — reject inference requests while PoC generation is active to prevent NCCL deadlocks

  2. Grammar graceful degradation (134609f) — disable grammar decoding when enforced tokens conflict with the grammar FSM, instead of failing repeatedly

  3. Logprobs mode auto-detection — experiments showed that raw_logprobs perform significantly better than processed_logprobs for inference validation, while vLLM 0.9.1 defaults to processed. We need to classify the logprobs type during validation and switch to the correct mode automatically

Known Issues

Cross-GPU-generation validation, specifically Ampere vs Hopper/Blackwell, may produce logprob values beyond the match threshold, especially on long-context prompts. This means nodes running on A100 GPUs have a higher probability of inference invalidation compared to newer architectures. Switching to raw logprobs mode is expected to resolve this.

Usage

Building images

This branch can be used to build a Docker image, which serves as a base for the MLNode image.

vLLM image — build with Dockerfile.quick in the repo root, or use the prebuilt image.

MLNode image — build with this Dockerfile (updated base image + FlashAttention install), or use the prebuilt image.

Running the model

Inside the MLNode container, activate the environment and start the server:

. /app/packages/api/.venv/bin/activate \
  && cd /app/packages \
  && python -m uvicorn api.app:app --host 0.0.0.0 --port 8080

The vLLM model server is launched with:

python3 -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3-235B-A22B-Instruct-2507-FP8 \
  --dtype auto \
  --host 0.0.0.0 \
  --port 5001 \
  --tensor-parallel-size 4 \
  --max-model-len 240000 \
  --max-num-batched-tokens 32768 \
  --attention-backend FLASHINFER \
  --logprobs-mode processed_logprobs \
  --compilation-config '{"custom_ops": ["+quant_fp8", "+rms_norm", "+silu_and_mul", "+fused_moe", "+rotary_embedding", "+apply_rotary_emb", "none"]}'

Required environment variables:

export VLLM_ATTENTION_BACKEND=FLASHINFER
export VLLM_ALLOW_INSECURE_SERIALIZATION=1

Note: deployment setup is being cleaned up and will be simplified soon.


This PR and the computational experiments are a joint work of @tamazgadaev @baychak @gmorgachev @clanster @qdanik

baychak and others added 27 commits February 18, 2026 10:58
Port of PoC v2 (proof-of-computation) to vLLM 0.15.1 V1 engine.
Key changes vs V0 port:
- poc_model_runner.py fully rewritten for V1 attention metadata
- engine_patch.py: monkey-patches AsyncLLM.poc_request via collective_rpc
- layer_hooks.py: ContextVar replaced with global bool for torch.compile
- gpu_random.py: batched random generation
- generate_queue.py: adapted for V1 request lifecycle

12 files in vllm/poc/
Enforced tokens enable exact token sequence replay for logprob validation.
- sampling_params.py: add enforced_token_ids field
- validation.py: new EnforcedToken/EnforcedTokens classes
- protocol.py, serving.py: wire enforced tokens into chat completion API
- gpu_input_batch.py: sampling support for enforced tokens in V1 worker

5 files (1 new + 4 patched)
Mount poc_router alongside main router and clear PoC queue on shutdown.
Python-only overlay on pre-built vLLM v0.15.1 image.
Skips C extension compilation since csrc/ is unchanged.
Build time: seconds instead of 30-60 minutes.
Made-with: Cursor
Tg/batch mising fix
- data.py: reject non-finite vectors as mismatch before L2 distance
- validation.py: skip non-finite received vectors, count as mismatch
- callbacks.py: drop payload after POC_CALLBACK_MAX_RETRIES (default 10)

Ported from upstream gonka-ai/vllm commit b8dea26 and e388951.
PoC v2 logic on this branch (vLLM 0.15.1 V1 engine) is equivalent to:

  repo:   gonka-ai/vllm
  branch: gm/poc-state
  commit: e388951 (refactoring)
  tag:    release/v0.9.1-pocv2-post6

Ported from upstream:
- NaN/Inf guards in data.py and validation.py (b8dea26)
- Callback max retries with drop (e388951)

V1-specific adaptations (not in upstream):
- engine_patch.py: monkey-patch for AsyncLLM
- poc_model_runner.py: V1 attention metadata, KV cache blocks
- layer_hooks.py: global bool instead of ContextVar (torch.compile compat)
- gpu_random.py: batched GPU operations (bit-identical output)
torch.compile traces input_ids.size() during CUDA graph capture,
which crashes with NoneType when input_ids=None. Pass a zero tensor
instead — model ignores input_ids when inputs_embeds is provided.
Port changes from upstream e388951:
- validation: add k_dim shape check for received vectors
- callbacks: add CallbackQueue with bounded concurrency
- generate_queue: RPC timeout handling, callback queue integration
- routes: add _poc_generation_active flag for chat endpoint
fix(poc): pass dummy input_ids for CUDA graph compatibility
Replace per-nonce forward loop with a single batched model() call.
Restores original gonka-source behavior (vLLM 0.9.1) where all nonces
were processed together. Includes batch-aware attention metadata,
NaN detection, and vectorized output stage.
feat(poc): batch all nonces in single forward pass
Upstream: gonka-ai/vllm branch gm/poc-state @ e388951
Our base: kaitakuai/vllm branch mb/poc-v015 @ 3428ed8

Parity covers: batched forward pass, callback queue,
k_dim validation, RPC timeouts, generation-active flag.

V1-specific advantages retained: CommonAttentionMetadata
with real KV cache blocks, torch.compile-compatible layer
hooks (global bool vs ContextVar), GPU-vectorized random
ops, two-level NaN detection.
…hed tokens default parameters baked into vllm
Add logprobs mode as request param + on-validation classifier
@tamazgadaev

tamazgadaev commented Apr 7, 2026

Copy link
Copy Markdown

Default parameters baked

  • Now attention_backend=FLASHINFER , logprobs_mode=processed_logprobs, compilation_config='{"custom_ops": ["+quant_fp8", "+rms_norm", "+silu_and_mul", "+fused_moe", "+rotary_embedding", "+apply_rotary_emb", "none"]}' (for Qwen 235B) and max-num-batched-tokens=32768 are baked into default vllm engine parameters
  • All 4 of them are rewritable by using additional_args in vllm deployment, so anyone can choose their own attention backend easily. The values above were chosen from the perspective of stability and good enough inference and PoC performance

Now the correct deployment is just

python3 -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3-235B-A22B-Instruct-2507-FP8 \
  --dtype auto \
  --tensor-parallel-size 4 \
  --max-model-len 240000

Logprobes switch merge

  • Adds a per-request logprobs_mode override (raw_logprobs / processed_logprobs) that can be set on each individual API request, overriding the deployment-level --logprobs-mode default. Requests without the override fall back to the deployment default as before.
  • Supports mixed batches: when requests with different logprobs modes land in the same batch, the sampler computes both raw and processed logprobs and merges per row. Homogeneous batches (all raw or all processed) take the same fast path as before with zero overhead.
  • Adds automatic logprobs mode detection for validation (enforced sampling) requests: when logprobs_mode is not explicitly set on the request but enforced_tokens are present, a lightweight classifier inspects the top_token_ids to determine whether the original inference used raw or processed logprobs, and sets the mode accordingly. This handles backward compatibility with older vLLM versions that didn't support per-request logprobs_mode.
  • The classifier works by counting the fraction of top-k token IDs below 4 (Qwen's lowest vocab IDs used as padding in processed mode). A ratio above 10% means processed, below means raw. Validated at 99.95% accuracy across 8,756 rows in prior experiments.
  • Priority chain: explicit logprobs_mode on the request > auto-detected mode from enforced token IDs > deployment-level --logprobs-mode default.

Updated vLLM: ghcr.io/product-science/vllm:v0.15.1-alpha1
Updated MLnode: ghcr.io/product-science/mlnode:3.0.13-alpha1

What's left: chat priority gating, grammar graceful degradation

@tamazgadaev

tamazgadaev commented Apr 9, 2026

Copy link
Copy Markdown

UPD: Priority gating mechanism described below has a flaw. We should not drain inferneces, we should abort them. Fix is in the next commit and described in the next comment

The last merge implemented the mentioned above Chat priority gating and Grammar graceful degradation

Summary

  • Adds chat priority gating for PoC: when PoC generation is active, the chat completion endpoint returns 503 (service_unavailable) to prevent GPU contention and NCCL deadlocks. The flag is set before the generation task is created and cleared via an add_done_callback on the task, covering normal completion, cancellation, and failure.
  • Drains in-flight inference before PoC GPU work: poc_request now waits for all unfinished requests to complete before issuing collective_rpc. This is required because execute_poc_forward reuses KV-cache blocks starting from block 0 — any overlap with live inference permanently corrupts model output. The drain is bounded by timeout_ms; on timeout, PoC is skipped with {"skipped": True}.
  • Removes attention metadata caching in poc_model_runner: the metadata builder's internal state (workspace buffers, page-table references) is mutated by every inference engine step, so reusing stale metadata caused the attention backend to write only a fraction of the expected KV entries, producing all-NaN hidden states. Rebuilding costs <1 ms vs ~15 ms for the model forward.
  • Adds grammar graceful degradation in xgrammar backend: when a grammar FSM rejects a token (e.g. enforced tokens from validation replay that conflict with the JSON schema), grammar enforcement is disabled for the remainder of that request instead of crashing. The _grammar_failed flag short-circuits accept_tokens, rollback, and fill_bitmask; it resets on reset().
  • Bakes deployment defaults into vLLM config: FlashInfer attention backend via --attention-backend, processed_logprobs as the default --logprobs-mode, max_num_batched_tokens=32768 for the OpenAI API server, and a Qwen3-MoE-specific custom_ops compilation preset. Also forces custom_ops=all unconditionally instead of branching on inductor/compilation mode.

Updated images

vLLM: ghcr.io/product-science/vllm:v0.15.1-alpha2
MLnode: ghcr.io/product-science/mlnode:3.0.13-alpha2

@tcharchian tcharchian moved this from Todo to In Progress in Upgrade v0.2.12 Apr 9, 2026
@tamazgadaev

Copy link
Copy Markdown

Added tests/gonka for live and unit tests for some important scenarios. Doesn't influence vllm work.

@tamazgadaev

tamazgadaev commented Apr 9, 2026

Copy link
Copy Markdown

PoC priority: abort in-flight inference instead of draining

Previously, when PoC started, poc_request waited for all in-flight inference to finish naturally (drain). This delayed PoC by however long the current generation took to complete.

Now poc_request aborts all in-flight requests immediately via self.abort() before running collective_rpc. Also added the 503 guard to the /v1/completions endpoint (was already on /v1/chat/completions).

New images

vLLM: ghcr.io/product-science/vllm:v0.15.1-alpha3
MLnode: ghcr.io/product-science/mlnode:3.0.13-alpha3

@tamazgadaev

Copy link
Copy Markdown

vLLM: ghcr.io/product-science/vllm:v0.15.1-alpha5
MLnode: ghcr.io/product-science/mlnode:3.0.13-alpha5

Updates:

  • Merge of feat: stronger seed #30 with stronger PoC Seed
  • Make default gpu_memory_utilization="0.925" to fit into 8xH100
  • Hardcode dtype='auto' into vllm

@gmorgachev

Copy link
Copy Markdown
Author

best result so far at ghcr.io/product-science/mlnode:3.0.13-alpha5, 8xH100

export VLLM_ATTENTION_BACKEND=FLASHINFER
export LD_LIBRARY_PATH=/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH
export VLLM_ALLOW_INSECURE_SERIALIZATION=1
export POC_RPC_TIMEOUT_MS=300000
export POC_BATCH_SIZE_DEFAULT=16

args:

  "additional_args": [
    "--tensor-parallel-size", "4",
    "--max-num-batched-tokens", "65536",
    "--gpu-memory-utilization", "0.92",
    "--max-num-seqs", "128",
    "--max-model-len", "240000",
    "--enable-expert-parallel",
    "--disable-custom-all-reduce",
    "--num-gpu-blocks-override", "15000"
  ]

shows around 1295 nonces / min с from each 4xH100 (2590 nonces / min total)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

5 participants