Skip to content

introduce adaptive stream interception and dynamic submit frequency for vLLM#24

Open
tonghaoxin wants to merge 1 commit into
XpuOS:mainfrom
tonghaoxin:vllm_xsched
Open

introduce adaptive stream interception and dynamic submit frequency for vLLM#24
tonghaoxin wants to merge 1 commit into
XpuOS:mainfrom
tonghaoxin:vllm_xsched

Conversation

@tonghaoxin
Copy link
Copy Markdown

1. Motivation & Problem Statement

When deploying high-concurrency LLM inference engines like vLLM alongside other models using xsched, two critical bottlenecks emerge in the CUDA interception layer (shim.cpp):

  • Stream Escape (Interception Blind Spot): vLLM uses independent or hidden stream submission mechanisms that bypass the default interception logic. This makes it invisible to the scheduler, completely losing control over the engine.
  • Overhead Inversion (GPU Starvation): vLLM submits a massive amount of microsecond-level kernels. The native submit/accounting mechanism (waking up threads for every single kernel) introduces software scheduling overhead that far exceeds the actual GPU execution time, leading to severe performance degradation.

2. Proposed Solution (Modifications in shim.cpp)

This PR introduces two major enhancements to address the above issues:

  • Global Stream Takeover: Overhauled the registration logic in shim.cpp (specifically hooking AutoCreate) to forcefully capture and register all newly emerged streams into xqueue. This completely eliminates the vLLM monitoring blind spot and ensures 100% scheduler control.
  • Adaptive Submit Frequency: Implemented a dynamic, microsecond-level accounting mechanism:
    • Silent/Low-Priority Period: Under normal loads, the submit frequency is aggressively reduced (e.g., 1 submit per 500 kernels) to maximize throughput and eliminate software overhead.
    • Preemption Period: Upon detecting high-priority VIP tasks, it performs a rapid fallback to a 1/1 submit frequency for immediate preemption, ensuring strict SLA for online services.

3. Benchmark & Impact

We have validated this approach in a production Proof-of-Concept (POC) environment combining vLLM (Qwen3-8B) with PyTorch-based Embedding, CLIP, and Rerank models on a single L40S GPU:

  • Hardware Cost Reduction: Successfully compressed 4 independent GPUs' workload into 1 single L40S GPU.
  • Performance Metrics: - Offline task throughput degradation is kept under 5%.
    • High-priority online task (Rerank) latency increased by only ~15% under full offline load.

4. Checklist

  • I have read the [CONTRIBUTING.md] document.
  • I have tested these changes locally and verified the performance improvements.
  • My code follows the code style of this project.

@wuwen03
Copy link
Copy Markdown
Contributor

wuwen03 commented Apr 6, 2026

Hi @tonghaoxin

Thanks a lot for the contribution and for sharing you motivation. We really appreciate the effort you put into this PR.

A few comments from our side:

About the stream takeover / interception blind spot part

We think your point about stream escaping is valid, and this part of the problem is worth addressing. Your fix makes sense, but we would like to make some modifications to this part of the implementation on our side to keep it more consistent with the existing code style and overall structure of the project.

About the Adaptive Submit Frequency part

We are curious whether you have tried the Highest Priority First (HPF) policy. In theory, it should already help address the issue you described.

More generally, we feel the current adaptive submit mechanism is somewhat too specific to this particular workload pattern / deployment scenario, and may not fully align with the general design philosophy of XSched. We prefer to keep the core scheduling logic as generic as possible unless there is strong evidence that such specialization should be part of the mainline project.

If you have a different view on this, we are very open to continuing the discussion.

BTW, for easier open-source collaboration and long-term maintenance, we would appreciate it if code comments could be written in English.

Thanks again for the contribution! We look forward to further discussion in this PR.

@wuwen03
Copy link
Copy Markdown
Contributor

wuwen03 commented Apr 23, 2026

Hi @tonghaoxin

We have prepared a batch of patches to address the first issue you mentioned. However,during our testing, we are unable to find the evidence that the vLLM is using per-thread default stream mode. Could you share more information about the vLLM version, environments or any other relevant information to help us investigate further?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants