introduce adaptive stream interception and dynamic submit frequency for vLLM#24
introduce adaptive stream interception and dynamic submit frequency for vLLM#24tonghaoxin wants to merge 1 commit into
Conversation
|
Hi @tonghaoxin Thanks a lot for the contribution and for sharing you motivation. We really appreciate the effort you put into this PR. A few comments from our side: About the stream takeover / interception blind spot part We think your point about stream escaping is valid, and this part of the problem is worth addressing. Your fix makes sense, but we would like to make some modifications to this part of the implementation on our side to keep it more consistent with the existing code style and overall structure of the project. About the Adaptive Submit Frequency part We are curious whether you have tried the Highest Priority First (HPF) policy. In theory, it should already help address the issue you described. More generally, we feel the current adaptive submit mechanism is somewhat too specific to this particular workload pattern / deployment scenario, and may not fully align with the general design philosophy of XSched. We prefer to keep the core scheduling logic as generic as possible unless there is strong evidence that such specialization should be part of the mainline project. If you have a different view on this, we are very open to continuing the discussion. BTW, for easier open-source collaboration and long-term maintenance, we would appreciate it if code comments could be written in English. Thanks again for the contribution! We look forward to further discussion in this PR. |
|
Hi @tonghaoxin We have prepared a batch of patches to address the first issue you mentioned. However,during our testing, we are unable to find the evidence that the vLLM is using per-thread default stream mode. Could you share more information about the vLLM version, environments or any other relevant information to help us investigate further? |
1. Motivation & Problem Statement
When deploying high-concurrency LLM inference engines like vLLM alongside other models using
xsched, two critical bottlenecks emerge in the CUDA interception layer (shim.cpp):2. Proposed Solution (Modifications in
shim.cpp)This PR introduces two major enhancements to address the above issues:
shim.cpp(specifically hookingAutoCreate) to forcefully capture and register all newly emerged streams intoxqueue. This completely eliminates the vLLM monitoring blind spot and ensures 100% scheduler control.3. Benchmark & Impact
We have validated this approach in a production Proof-of-Concept (POC) environment combining vLLM (Qwen3-8B) with PyTorch-based Embedding, CLIP, and Rerank models on a single L40S GPU:
4. Checklist