Skip to content

meierteng/DTS-vLLM

Repository files navigation

DTS-vLLM

A vLLM-based implementation of Decoding Tree Sketching, an entropy-guided tree search decoding framework that improves reasoning accuracy of language models.

This repository migrates DTS from the original HuggingFace Transformers backend to vLLM, enabling efficient inference with PagedAttention and automatic prefix caching. Supports multi-GPU acceleration via Tensor Parallelism.

Installation

git clone https://github.com/meierteng/DTS-vLLM.git
cd DTS-vLLM
pip install -e .

Requires Python >= 3.10, PyTorch >= 2.4, vLLM >= 0.8.

How DTS Works

DTS monitors Shannon entropy (H) and varentropy (V) during generation. When the model is confident overall but split between a few alternatives (low H, high V), DTS branches the decoding tree.

Phase 1 (Branching): Generate one token at a time, branch when the condition is met, until max_active_hyps hypotheses are created.

Phase 2 (Completion): Complete all hypotheses in a single batched generate() call. This phase dominates >99% of wall time.

Two decoding modes:

  • Greedy (--num_traces 1): Return the first trace to finish.
  • Stable (--num_traces 8): Majority vote across multiple traces.

Supported Models and Datasets

Model Key Model
1.5B deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
7B deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
phi-4-mini-reasoning microsoft/Phi-4-mini-reasoning
qwen30p6 Qwen/Qwen3-0.6B
Dataset Key Dataset
aime24 AIME 2024 (30 problems)
aime25 AIME 2025 (30 problems)
gpqa_diamond GPQA Diamond
livebench_reasoning LiveBench Reasoning

Usage

Single GPU

# DTS-Greedy
python run_benchmark.py dts \
    --model_name 1.5B --dataset_name aime24 \
    -e 2.5 -k 3 -a 48 -m 32768 -t 0.6 \
    -s 0 -n 5 --num_traces 1 --enforce_eager

# DTS-Stable (majority vote)
python run_benchmark.py dts \
    --model_name 1.5B --dataset_name aime24 \
    -e 2.5 -k 3 -a 48 -m 32768 -t 0.6 \
    -s 0 -n 5 --num_traces 8 --enforce_eager

# Standard autoregressive baseline
python run_benchmark.py standard \
    --model_name 1.5B --dataset_name aime24 \
    -t 0.6 -s 0 -n 5

Multi-GPU (Tensor Parallelism)

python run_benchmark.py dts \
    --model_name 7B --dataset_name aime25 -t 0.5 \
    -e 2.5 -k 3 -a 48 -m 32768 -s 0 -n 1 --num_traces 1 --enforce_eager \
    --tensor_parallel_size 2

For multi-GPU runs, set these environment variables:

export VLLM_SKIP_P2P_CHECK=1
export NCCL_IGNORE_DISABLED_P2P=1
export CUDA_DEVICE_ORDER=PCI_BUS_ID
export VLLM_WORKER_MULTIPROC_METHOD=spawn

Parameters

Parameter Flag Default Description
Entropy threshold -e 2.5 Branch when H <= this
Varentropy threshold -v 1.5 Branch when V > this
Branch top-k -k 3 Children per branch point
Max active hypotheses -a 48 Freeze threshold
Max new tokens -m 32768 Per-hypothesis token limit
Temperature -t 0.6 Sampling temperature
Num traces --num_traces 8 1 for Greedy, 8 for Stable
Tensor parallel size --tensor_parallel_size 1 Number of GPUs
Trials -n 1 Number of seeds to run
Seed -s 0 Starting random seed

Results

All experiments: DTS-Greedy (num_traces=1), A100-SXM4-80GB, vLLM 0.8.5.post1, enforce_eager=True, 5 seeds (0-4).

DeepSeek-R1-Distill-Qwen-1.5B — AIME 2024 (temp=0.6)

Method Our Accuracy Paper
Standard 28.67% 26.67%
DTS-Greedy (1 trace) 53.33% 54.67%
DTS-Stable (8 traces) 67.33% 64.67%

DeepSeek-R1-Distill-Qwen-1.5B — AIME 2025 (temp=0.5)

Method Our Accuracy Paper
DTS-Greedy (1 trace) 30.00% 34.67%

DeepSeek-R1-Distill-Qwen-7B — AIME 2025 (temp=0.5): TP=1 vs TP=2

Config Seed 0 Seed 1 Seed 2 Seed 3 Seed 4 Mean Acc Mean Time
TP=2 (2×A100) 50.00% 50.00% 53.33% 53.33% 50.00% 51.33% 12920s (3.6h)
TP=1 (1×A100) 60.00% 56.67% 46.67% 56.67% 46.67% 53.33% 25369s (7.0h)

TP=2 achieves ~1.96x speedup over TP=1 with comparable accuracy (within variance).

TP Compatibility

TP requires num_attention_heads % TP == 0 and num_kv_heads % TP == 0.

Model TP=2 TP=4
1.5B (12 heads, 2 kv) Yes No
7B (28 heads, 4 kv) Yes Yes
Qwen3-0.6B (16 heads, 8 kv) Yes Yes
Phi-4-mini (32 heads, 8 kv) Yes Yes

All Logs

All experiment logs are preserved in logs/.

Citation

@article{xu2025dts,
  title={DTS: Enhancing Large Reasoning Models via Decoding Tree Sketching},
  author={Xu, Zicheng and Wang, Guanchu and Chuang, Yu-Neng and Zheng, Guangyao and Szalay, Alexander S and Liu, Zirui and Braverman, Vladimir},
  journal={arXiv preprint arXiv:2511.00640},
  year={2025}
}

License

MIT License. See LICENSE.

About

Decoding Tree Sketching migrated to vLLM backend - retroactive branching with PagedAttention and automatic prefix caching

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors