fix(rollout): Resume vLLM workers before draining in abort#294
Open
Josephasafg wants to merge 3 commits into
Open
fix(rollout): Resume vLLM workers before draining in abort#294Josephasafg wants to merge 3 commits into
Josephasafg wants to merge 3 commits into
Conversation
Signed-off-by: Josephasafg <ajgard7@gmail.com>
Signed-off-by: Josephasafg <ajgard7@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request modifies the abort logic in vime/rollout/vllm_rollout.py to resume workers before draining pending tasks, which prevents potential hangs from raced requests. It also introduces a helper function _broadcast_to_workers and adds corresponding unit tests. The reviewer suggested utilizing the newly introduced _broadcast_to_workers helper function within the abort function to reduce code duplication for the pause and resume operations.
Signed-off-by: Josephasafg <ajgard7@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
abort()ordered the worker calls as pause → drain → resume, which can hang the rollout under--partial-rolloutwith continuously-submitting (e.g. multi-turn) rollouts.The call to
/pause?mode=abortaborts the known requests but leaves the engine inPAUSED_NEWstatus, so a/generatePOST that races in after the pause ends up in the waiting queue and never returns. One such request blocks thewhile state.pendingsdrain forever - and/resume, the only thing that frees it, sat after the drain.The fix here is to reorder to pause → resume → drain so the parked requests get scheduled and return, letting the drain complete. Tasks still queued on the semaphore short-circuit on
state.aborted, so nothing new is generated.