DTS-vLLM

A vLLM-based implementation of Decoding Tree Sketching, an entropy-guided tree search decoding framework that improves reasoning accuracy of language models.

This repository migrates DTS from the original HuggingFace Transformers backend to vLLM, enabling efficient inference with PagedAttention and automatic prefix caching. Supports multi-GPU acceleration via Tensor Parallelism.

Installation

git clone https://github.com/meierteng/DTS-vLLM.git
cd DTS-vLLM
pip install -e .

Requires Python >= 3.10, PyTorch >= 2.4, vLLM >= 0.8.

How DTS Works

DTS monitors Shannon entropy (H) and varentropy (V) during generation. When the model is confident overall but split between a few alternatives (low H, high V), DTS branches the decoding tree.

Phase 1 (Branching): Generate one token at a time, branch when the condition is met, until max_active_hyps hypotheses are created.

Phase 2 (Completion): Complete all hypotheses in a single batched generate() call. This phase dominates >99% of wall time.

Two decoding modes:

Greedy (--num_traces 1): Return the first trace to finish.
Stable (--num_traces 8): Majority vote across multiple traces.

Supported Models and Datasets

Model Key	Model
`1.5B`	deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
`7B`	deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
`phi-4-mini-reasoning`	microsoft/Phi-4-mini-reasoning
`qwen30p6`	Qwen/Qwen3-0.6B

Dataset Key	Dataset
`aime24`	AIME 2024 (30 problems)
`aime25`	AIME 2025 (30 problems)
`gpqa_diamond`	GPQA Diamond
`livebench_reasoning`	LiveBench Reasoning

Usage

Single GPU

# DTS-Greedy
python run_benchmark.py dts \
    --model_name 1.5B --dataset_name aime24 \
    -e 2.5 -k 3 -a 48 -m 32768 -t 0.6 \
    -s 0 -n 5 --num_traces 1 --enforce_eager

# DTS-Stable (majority vote)
python run_benchmark.py dts \
    --model_name 1.5B --dataset_name aime24 \
    -e 2.5 -k 3 -a 48 -m 32768 -t 0.6 \
    -s 0 -n 5 --num_traces 8 --enforce_eager

# Standard autoregressive baseline
python run_benchmark.py standard \
    --model_name 1.5B --dataset_name aime24 \
    -t 0.6 -s 0 -n 5

Multi-GPU (Tensor Parallelism)

python run_benchmark.py dts \
    --model_name 7B --dataset_name aime25 -t 0.5 \
    -e 2.5 -k 3 -a 48 -m 32768 -s 0 -n 1 --num_traces 1 --enforce_eager \
    --tensor_parallel_size 2

For multi-GPU runs, set these environment variables:

export VLLM_SKIP_P2P_CHECK=1
export NCCL_IGNORE_DISABLED_P2P=1
export CUDA_DEVICE_ORDER=PCI_BUS_ID
export VLLM_WORKER_MULTIPROC_METHOD=spawn

Parameters

Parameter	Flag	Default	Description
Entropy threshold	`-e`	2.5	Branch when H <= this
Varentropy threshold	`-v`	1.5	Branch when V > this
Branch top-k	`-k`	3	Children per branch point
Max active hypotheses	`-a`	48	Freeze threshold
Max new tokens	`-m`	32768	Per-hypothesis token limit
Temperature	`-t`	0.6	Sampling temperature
Num traces	`--num_traces`	8	1 for Greedy, 8 for Stable
Tensor parallel size	`--tensor_parallel_size`	1	Number of GPUs
Trials	`-n`	1	Number of seeds to run
Seed	`-s`	0	Starting random seed

Results

All experiments: DTS-Greedy (num_traces=1), A100-SXM4-80GB, vLLM 0.8.5.post1, enforce_eager=True, 5 seeds (0-4).

DeepSeek-R1-Distill-Qwen-1.5B — AIME 2024 (temp=0.6)

Method	Our Accuracy	Paper
Standard	28.67%	26.67%
DTS-Greedy (1 trace)	53.33%	54.67%
DTS-Stable (8 traces)	67.33%	64.67%

DeepSeek-R1-Distill-Qwen-1.5B — AIME 2025 (temp=0.5)

Method	Our Accuracy	Paper
DTS-Greedy (1 trace)	30.00%	34.67%

DeepSeek-R1-Distill-Qwen-7B — AIME 2025 (temp=0.5): TP=1 vs TP=2

Config	Seed 0	Seed 1	Seed 2	Seed 3	Seed 4	Mean Acc	Mean Time
TP=2 (2×A100)	50.00%	50.00%	53.33%	53.33%	50.00%	51.33%	12920s (3.6h)
TP=1 (1×A100)	60.00%	56.67%	46.67%	56.67%	46.67%	53.33%	25369s (7.0h)

TP=2 achieves ~1.96x speedup over TP=1 with comparable accuracy (within variance).

TP Compatibility

TP requires num_attention_heads % TP == 0 and num_kv_heads % TP == 0.

Model	TP=2	TP=4
1.5B (12 heads, 2 kv)	Yes	No
7B (28 heads, 4 kv)	Yes	Yes
Qwen3-0.6B (16 heads, 8 kv)	Yes	Yes
Phi-4-mini (32 heads, 8 kv)	Yes	Yes

All Logs

All experiment logs are preserved in logs/.

Citation

@article{xu2025dts,
  title={DTS: Enhancing Large Reasoning Models via Decoding Tree Sketching},
  author={Xu, Zicheng and Wang, Guanchu and Chuang, Yu-Neng and Zheng, Guangyao and Szalay, Alexander S and Liu, Zirui and Braverman, Vladimir},
  journal={arXiv preprint arXiv:2511.00640},
  year={2025}
}

License

MIT License. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
configs		configs
decoding_tree_sketching		decoding_tree_sketching
logs		logs
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
dump_tree_trace.py		dump_tree_trace.py
pyproject.toml		pyproject.toml
run_benchmark.py		run_benchmark.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DTS-vLLM

Installation

How DTS Works

Supported Models and Datasets

Usage

Single GPU

Multi-GPU (Tensor Parallelism)

Parameters

Results

DeepSeek-R1-Distill-Qwen-1.5B — AIME 2024 (temp=0.6)

DeepSeek-R1-Distill-Qwen-1.5B — AIME 2025 (temp=0.5)

DeepSeek-R1-Distill-Qwen-7B — AIME 2025 (temp=0.5): TP=1 vs TP=2

TP Compatibility

All Logs

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DTS-vLLM

Installation

How DTS Works

Supported Models and Datasets

Usage

Single GPU

Multi-GPU (Tensor Parallelism)

Parameters

Results

DeepSeek-R1-Distill-Qwen-1.5B — AIME 2024 (temp=0.6)

DeepSeek-R1-Distill-Qwen-1.5B — AIME 2025 (temp=0.5)

DeepSeek-R1-Distill-Qwen-7B — AIME 2025 (temp=0.5): TP=1 vs TP=2

TP Compatibility

All Logs

Citation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages