[Feature] TransferQueue Integration for Rollout-to-Training by miracle0517 · Pull Request #242 · vllm-project/vime

miracle0517 · 2026-06-12T13:47:32Z

✨ Summary

Introduce an optional VIME TransferQueue data path for transferring rollout data to training.
When enabled via --enable-vime-transfer-queue, rollout batches are written directly into the TransferQueue. Megatron actor/critic workers then fetch their DP-local training data straight from TQ, bypassing the previous Ray ObjectRef rollout payload path.

🔧 What’s Changed

TransferQueueBridge – A new central component that handles:
- TQ initialization and client connections
- Field conversion
- Partition cleanup
- Unified data fetch/write APIs
Rollout Workers now publish normalized rollout batches directly into the TQ.
Megatron Actor Workers now fetch rollout data from the TQ instead of relying on Ray ObjectRefs.
Local Training Schedule Preservation – After fetching from TQ, the following fields are kept intact:
- global_batch_sizes
- num_microbatches
- micro_batch_indices
Backpressure & Cleanup – TQ staleness backpressure is enforced, and explicit partition cleanup runs once consumers have finished.
Extended Field Support – Both tensor and non‑tensor rollout fields are supported, including multimodal inputs, metadata, prompts, routing replay data, and extra configured fields.

read-the-docs-community · 2026-06-12T13:48:37Z

Documentation build overview

📚 vime | 🛠️ Build #33162414 | 📁 Comparing 4a6fbd0 against latest (fa0b6e9)

🔍 Preview build

26 files changed · ± 26 modified

± Modified

gemini-code-assist

Code Review

This pull request introduces TransferQueueBridge as an optional, high-performance rollout-to-training data plane (TransferQueue) to replace the existing Ray ObjectRef path. It integrates this bridge across the training loop, rollout manager, and Megatron actor/critic backends, while adding corresponding CLI arguments and validation. Feedback on these changes highlights two key areas for improvement: first, a busy-wait loop in _get_rollout_data_from_transfer_queue should include a small sleep to prevent high CPU utilization and network congestion during collective broadcasts; second, dict_to_tensordict should explicitly handle numpy.ndarray elements to avoid slow fallback serialization via NonTensorData.

gemini-code-assist · 2026-06-12T13:49:29Z

+        while rollout_data is None:
+            rollout_data, batch_meta = self.transfer_queue.get_data(
+                rollout_id,
+                task_name=task_name,
+                data_fields=data_fields,
+            )


The while rollout_data is None: loop busy-waits infinitely when the TransferQueue does not have data ready yet. Because get_data performs collective broadcasts (_broadcast_payload) across all ranks in the model parallel group, this tight loop will cause all ranks to continuously execute collective communications (dist.broadcast_object_list). This will peg 100% CPU on all training processes, cause severe network congestion, and potentially lead to desynchronization or NCCL timeouts.\n\nAdding a small sleep (e.g., time.sleep(0.1)) inside the loop when rollout_data is None will gracefully poll the queue and prevent CPU/network saturation.

while rollout_data is None:\n rollout_data, batch_meta = self.transfer_queue.get_data(\n rollout_id,\n task_name=task_name,\n data_fields=data_fields,\n )\n if rollout_data is None:\n import time\n time.sleep(0.1)

gemini-code-assist · 2026-06-12T13:49:29Z

+            if value and isinstance(value[0], torch.Tensor):
+                tensors = []
+                for item in value:
+                    tensor = item.detach()
+                    if tensor.device.type != "cpu":
+                        tensor = tensor.cpu()
+                    if device is not None:
+                        tensor = tensor.to(device)
+                    tensors.append(tensor)
+                if tensors[0].ndim == 0:
+                    result[key] = torch.stack(tensors)
+                else:
+                    result[key] = torch.nested.as_nested_tensor(tensors, layout=torch.jagged)
+                continue


In dict_to_tensordict, list elements that are numpy.ndarray are not handled by the isinstance(value[0], torch.Tensor) check. Consequently, they fall back to the nesting_depth checks. If the numpy arrays have different shapes (e.g., variable sequence lengths), tensor_1d or tensor_2d will raise a ValueError, causing the field to silently fall back to non_tensor_stack (which wraps them in NonTensorData).\n\nNonTensorData uses standard Python pickling, which is extremely slow, lacks zero-copy serialization, and cannot be easily moved to GPU. Supporting numpy.ndarray directly in the tensor check by converting them to torch.Tensor using torch.from_numpy will keep them as highly efficient nested/jagged tensors.

if value and isinstance(value[0], (torch.Tensor, np.ndarray)):\n tensors = []\n for item in value:\n if isinstance(item, np.ndarray):\n tensor = torch.from_numpy(item)\n else:\n tensor = item.detach()\n if tensor.device.type != "cpu":\n tensor = tensor.cpu()\n if device is not None:\n tensor = tensor.to(device)\n tensors.append(tensor)\n if tensors[0].ndim == 0:\n result[key] = torch.stack(tensors)\n else:\n result[key] = torch.nested.as_nested_tensor(tensors, layout=torch.jagged)\n continue

aoshen02 · 2026-06-14T01:31:23Z

Hi, thanks for the contribution, is there any experiment results?

miracle0517 · 2026-06-15T01:17:13Z

Hi, thanks for the contribution, is there any experiment results?

The test results are here: #244

aoshen02 · 2026-06-16T13:35:47Z

This scale is not big enough, could you test with bigger vl models?

miracle0517 · 2026-06-18T03:58:07Z

This scale is not big enough, could you test with bigger vl models?

We will conduct follow-up tests on multi-machine scenarios with the vl model to obtain performance data. Thank you.

aoshen02 · 2026-06-18T08:18:15Z

This scale is not big enough, could you test with bigger vl models?

We will conduct follow-up tests on multi-machine scenarios with the vl model to obtain performance data. Thank you.

Cool, thank you.

miracle0517 added 2 commits June 12, 2026 21:29

Support TransferQueue as external data buffer

7e19434

fix bug

1e9cfb6

gemini-code-assist Bot reviewed Jun 12, 2026

View reviewed changes

add tq test script

63e5d80

miracle0517 force-pushed the feature/transferqueue branch from acfb259 to 63e5d80 Compare June 13, 2026 03:09

Fulin-Gao mentioned this pull request Jun 13, 2026

[Ascend][RFC] vime-ascend Build and Roadmap #243

Open

18 tasks

miracle0517 mentioned this pull request Jun 13, 2026

[RFC] Integrating TransferQueue into Vime: Decoupling Rollout and Training with a Streaming Data Bus #227

Open

8 tasks

CalvinXKY mentioned this pull request Jun 13, 2026

[Spike] [TQ] Transfer Queue test on H800 (Qwen3-4B GRPO) #244

Open

add requirement && fix tq import

4a6fbd0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] TransferQueue Integration for Rollout-to-Training#242

[Feature] TransferQueue Integration for Rollout-to-Training#242
miracle0517 wants to merge 4 commits into
vllm-project:feature/transferqueuefrom
miracle0517:feature/transferqueue

miracle0517 commented Jun 12, 2026

Uh oh!

read-the-docs-community Bot commented Jun 12, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Uh oh!

aoshen02 commented Jun 14, 2026

Uh oh!

miracle0517 commented Jun 15, 2026

Uh oh!

aoshen02 commented Jun 16, 2026

Uh oh!

miracle0517 commented Jun 18, 2026

Uh oh!

aoshen02 commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

miracle0517 commented Jun 12, 2026

✨ Summary

🔧 What’s Changed

Uh oh!

read-the-docs-community Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Documentation build overview

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

aoshen02 commented Jun 14, 2026

Uh oh!

miracle0517 commented Jun 15, 2026

Uh oh!

aoshen02 commented Jun 16, 2026

Uh oh!

miracle0517 commented Jun 18, 2026

Uh oh!

aoshen02 commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

read-the-docs-community Bot commented Jun 12, 2026 •

edited

Loading