A vLLM-based implementation of Decoding Tree Sketching, an entropy-guided tree search decoding framework that improves reasoning accuracy of language models.
This repository migrates DTS from the original HuggingFace Transformers backend to vLLM, enabling efficient inference with PagedAttention and automatic prefix caching. Supports multi-GPU acceleration via Tensor Parallelism.
git clone https://github.com/meierteng/DTS-vLLM.git
cd DTS-vLLM
pip install -e .Requires Python >= 3.10, PyTorch >= 2.4, vLLM >= 0.8.
DTS monitors Shannon entropy (H) and varentropy (V) during generation. When the model is confident overall but split between a few alternatives (low H, high V), DTS branches the decoding tree.
Phase 1 (Branching): Generate one token at a time, branch when the condition is met, until max_active_hyps hypotheses are created.
Phase 2 (Completion): Complete all hypotheses in a single batched generate() call. This phase dominates >99% of wall time.
Two decoding modes:
- Greedy (
--num_traces 1): Return the first trace to finish. - Stable (
--num_traces 8): Majority vote across multiple traces.
| Model Key | Model |
|---|---|
1.5B |
deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B |
7B |
deepseek-ai/DeepSeek-R1-Distill-Qwen-7B |
phi-4-mini-reasoning |
microsoft/Phi-4-mini-reasoning |
qwen30p6 |
Qwen/Qwen3-0.6B |
| Dataset Key | Dataset |
|---|---|
aime24 |
AIME 2024 (30 problems) |
aime25 |
AIME 2025 (30 problems) |
gpqa_diamond |
GPQA Diamond |
livebench_reasoning |
LiveBench Reasoning |
# DTS-Greedy
python run_benchmark.py dts \
--model_name 1.5B --dataset_name aime24 \
-e 2.5 -k 3 -a 48 -m 32768 -t 0.6 \
-s 0 -n 5 --num_traces 1 --enforce_eager
# DTS-Stable (majority vote)
python run_benchmark.py dts \
--model_name 1.5B --dataset_name aime24 \
-e 2.5 -k 3 -a 48 -m 32768 -t 0.6 \
-s 0 -n 5 --num_traces 8 --enforce_eager
# Standard autoregressive baseline
python run_benchmark.py standard \
--model_name 1.5B --dataset_name aime24 \
-t 0.6 -s 0 -n 5python run_benchmark.py dts \
--model_name 7B --dataset_name aime25 -t 0.5 \
-e 2.5 -k 3 -a 48 -m 32768 -s 0 -n 1 --num_traces 1 --enforce_eager \
--tensor_parallel_size 2For multi-GPU runs, set these environment variables:
export VLLM_SKIP_P2P_CHECK=1
export NCCL_IGNORE_DISABLED_P2P=1
export CUDA_DEVICE_ORDER=PCI_BUS_ID
export VLLM_WORKER_MULTIPROC_METHOD=spawn| Parameter | Flag | Default | Description |
|---|---|---|---|
| Entropy threshold | -e |
2.5 | Branch when H <= this |
| Varentropy threshold | -v |
1.5 | Branch when V > this |
| Branch top-k | -k |
3 | Children per branch point |
| Max active hypotheses | -a |
48 | Freeze threshold |
| Max new tokens | -m |
32768 | Per-hypothesis token limit |
| Temperature | -t |
0.6 | Sampling temperature |
| Num traces | --num_traces |
8 | 1 for Greedy, 8 for Stable |
| Tensor parallel size | --tensor_parallel_size |
1 | Number of GPUs |
| Trials | -n |
1 | Number of seeds to run |
| Seed | -s |
0 | Starting random seed |
All experiments: DTS-Greedy (num_traces=1), A100-SXM4-80GB, vLLM 0.8.5.post1, enforce_eager=True, 5 seeds (0-4).
| Method | Our Accuracy | Paper |
|---|---|---|
| Standard | 28.67% | 26.67% |
| DTS-Greedy (1 trace) | 53.33% | 54.67% |
| DTS-Stable (8 traces) | 67.33% | 64.67% |
| Method | Our Accuracy | Paper |
|---|---|---|
| DTS-Greedy (1 trace) | 30.00% | 34.67% |
| Config | Seed 0 | Seed 1 | Seed 2 | Seed 3 | Seed 4 | Mean Acc | Mean Time |
|---|---|---|---|---|---|---|---|
| TP=2 (2×A100) | 50.00% | 50.00% | 53.33% | 53.33% | 50.00% | 51.33% | 12920s (3.6h) |
| TP=1 (1×A100) | 60.00% | 56.67% | 46.67% | 56.67% | 46.67% | 53.33% | 25369s (7.0h) |
TP=2 achieves ~1.96x speedup over TP=1 with comparable accuracy (within variance).
TP requires num_attention_heads % TP == 0 and num_kv_heads % TP == 0.
| Model | TP=2 | TP=4 |
|---|---|---|
| 1.5B (12 heads, 2 kv) | Yes | No |
| 7B (28 heads, 4 kv) | Yes | Yes |
| Qwen3-0.6B (16 heads, 8 kv) | Yes | Yes |
| Phi-4-mini (32 heads, 8 kv) | Yes | Yes |
All experiment logs are preserved in logs/.
@article{xu2025dts,
title={DTS: Enhancing Large Reasoning Models via Decoding Tree Sketching},
author={Xu, Zicheng and Wang, Guanchu and Chuang, Yu-Neng and Zheng, Guangyao and Szalay, Alexander S and Liu, Zirui and Braverman, Vladimir},
journal={arXiv preprint arXiv:2511.00640},
year={2025}
}MIT License. See LICENSE.