Blazingly fast pairwise phylogenetic tree distance calculations β RobinsonβFoulds, Weighted RF, and KuhnerβFelsenstein β powered by Rust
Overview β’ Installing β’ Usage β’ Python API β’ Snap Format β’ Benchmarks
rapidtrees computes pairwise tree distances from BEAST/NEXUS .trees files or from precomputed .snap files and writes a labeled distance matrix. Three metrics are supported:
| Metric | Flag | Output | Description |
|---|---|---|---|
| RobinsonβFoulds | --metric rf |
integer | Symmetric difference of bipartitions |
| Weighted RF | --metric weighted |
float | Branch-length-weighted bipartition difference |
| KuhnerβFelsenstein | --metric kf |
float | Euclidean distance on branch lengths |
- π¦ Rust core β zero-overhead bitset operations with a cache-friendly memory layout
- π Parallel by default β powered by
rayon, automatically scales across all cores - π Python bindings β drop into any Python/NumPy workflow via
PyO3 - π¦ No Rust toolchain required β pre-built wheels on PyPI for Linux, macOS, and Windows
- ποΈ Gzip output β stream directly to
.tsv.gzwithout a separate compression step
Benchmarked on a ZIKA dataset (283 taxa Β· 4 000 trees Β· ~8 M comparisons) (non-gzipped output):
| Metric | Total time | Throughput |
|---|---|---|
| Robinson-Foulds | ~2.7 s | ~3.0 M comparisons/sec |
| Weighted RF | ~3.4 s | ~2.3 M comparisons/sec |
| Kuhner-Felsenstein | ~3.3 s | ~2.3 M comparisons/sec |
Pre-built wheels for Linux, macOS, and Windows. No Rust toolchain needed.
pip install rapidtreesInstall the standalone command-line binary. Requires the Rust toolchain.
cargo install rapidtrees- Rust toolchain β for building the Rust core
- pixi β for managing Python and R dependencies
git clone https://github.com/Joon-Klaps/rapidtrees.git
cd rapidtrees
# Set up environment β installs Python, R (with phangorn), and builds the package
pixi install# Run Python API tests (includes R/phangorn cross-validation)
pixi run test-python
# Run Rust unit tests
pixi run test-rustrapidtrees \
(--input <path/to/file.trees> | --snap-input <path/to/file.snap>) \
--output <path/to/output.tsv[.gz]> \
[--burnin-trees <N>] \
[--burnin-states <STATE>] \
[--use-real-taxa] \
[--metric rf|weighted|kf] \
[-q|--quiet]| Flag | Description |
|---|---|
-i, --input <INPUT> |
Path to BEAST .trees (NEXUS) file |
--snap-input <SNAP_INPUT> |
Path to .snap file (currently supports only --metric rf) |
-o, --output <OUTPUT> |
Output path. Use .gz suffix for gzip compression; - for stdout |
-t, --burnin-trees <N> |
Drop the first N trees (default: 0) |
-s, --burnin-states <STATE> |
Keep only trees with STATE > STATE (default: 0) |
--use-real-taxa |
Map numeric taxon IDs via the TRANSLATE block |
--metric <rf|weighted|kf> |
Distance metric (default: rf) |
-q, --quiet |
Suppress progress messages (errors still go to stderr) |
The output is a square TSV matrix where both the header row and first column contain tree names formatted as <file_basename>_tree_STATE<state>. Use -o - to write to stdout for easy piping.
Compute RF matrix β gzipped file
rapidtrees \
-i tests/data/hiv1.trees \
-o out/hiv1_rf.tsv.gz \
--metric rf
# Reading in beast 0.003s
# Read in 162 taxons for 21 trees
# Creating tree bit snapshots 0.002s
# Determining distances using RF for 210 combinations
# Determining distances using RF 0.000s
# Writing to output 0.000sApply burn-in by tree count
rapidtrees \
-i tests/data/hiv1.trees \
-o out/hiv1_rf.tsv \
-t 2
# Reading in beast 0.003s
# Read in 162 taxons for 19 trees
# Creating tree bit snapshots 0.001s
# Determining distances using RF for 171 combinations
# Determining distances using RF 0.000s
# Writing to output 0.000sCompute RF matrix directly from a snapshot file
rapidtrees \
--snap-input out/hiv1.snap \
-o out/hiv1_rf_from_snap.tsv.gz \
--metric rf
# Read snap with 21 trees and 2193 bipartitions in 0.001s
# Determined distances using RF in 0.000s
# Writing to out/hiv1_rf_from_snap.tsv.gz in 0.000srapidtrees exposes four functions from its Rust core. All accept a Python iterator of newick strings, keeping memory constant regardless of tree count.
| Function | Returns |
|---|---|
pairwise_rf_from_newick_iter |
(names, bytes) β RF matrix as flat uint32 bytes, row-major |
pairwise_rf_with_snapshots_from_newick_iter |
(names, bytes, leaf_names, n_bip, bytes, bytes) β RF matrix + presence matrix + clade bitmasks |
pairwise_wrf_from_newick_iter |
(names, list[float]) β Weighted RF, flat row-major |
pairwise_wrf_with_snapshots_from_newick_iter |
(names, bytes, leaf_names, n_bip, bytes, bytes) β wRF matrix + branch-length matrix + clade bitmasks |
pairwise_kf_from_newick_iter |
(names, list[float]) β Kuhner-Felsenstein, flat row-major |
pairwise_kf_with_snapshots_from_newick_iter |
(names, bytes, leaf_names, n_bip, bytes, bytes) β KF matrix + branch-length matrix + clade bitmasks |
import rapidtrees as rtd
import numpy as np
trees = [
"(A:0.1,(B:0.1,C:0.1):0.1);",
"(A:0.1,(C:0.1,B:0.1):0.1);",
"((A:0.1,B:0.1):0.1,C:0.1);",
]
names = ["t1", "t2", "t3"]
tree_names, rf_bytes = rtd.pairwise_rf_from_newick_iter(
names, iter(trees), [{}], [0, 0, 0]
)
rf = np.frombuffer(rf_bytes, dtype=np.uint32).reshape(len(tree_names), -1)For BEAST .trees files, translate maps, the snapshot API, and multi-file usage see docs/python-api.md.
rapidtrees can export tree snapshots to a compressed binary .snap file for downstream analyses (ESS computation, ASDSF, convergence diagnostics) on HPC clusters without re-parsing the original .trees files. The CLI can also compute RF distance matrices directly from .snap files via --snap-input.
A tree snapshot is a compact bitset representation of a phylogenetic tree. Each bipartition (split) is encoded as a bitset over leaf indices. The full set of snapshots for a tree collection captures everything needed for RF-family distance computations and convergence diagnostics β without storing the original Newick strings.
Note: Snapshots are not human-readable and are not intended for general interchange. They are an internal format optimized for fast distance calculations and cannot be convert to it's original newick-style format.
A .snap file is a gzip-compressed binary stream with the following sections in order:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β HEADER β
β 4 bytes magic "SNAP" (0x534E4150) β
β 1 byte version format version (currently 2) β
β 8 bytes n_trees u64 LE β number of trees β
β 8 bytes n_taxa u64 LE β number of leaf taxa β
β 8 bytes n_bip u64 LE β number of unique β
β bipartitions across all treesβ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β TAXA NAMES β
β For each of n_taxa: β
β 4 bytes length u32 LE β byte length of name β
β N bytes name UTF-8 string β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β TREE NAMES β
β For each of n_trees: β
β 4 bytes length u32 LE β byte length of name β
β N bytes name UTF-8 string β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β PRESENCE MATRIX β
β n_trees Γ n_bip bytes, row-major uint8 β
β presence[i][j] = 1 if bipartition j is in tree i β
β 0 otherwise β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Bipartition column order is deterministic: columns are sorted in ascending Bitset order (lexicographic over u64 words, i.e. by leaf-index bit pattern), so the same tree set always produces the same column indices regardless of parse order.
The presence matrix is sufficient for all major convergence diagnostics:
| Diagnostic | What you need |
|---|---|
| Pseudo ESS | presence.mean(axis=0) per chain |
| ASDSF | Per-chain split frequencies |
| FrΓ©chet ESS | RF distances via XOR of rows |
| WRF / KF distances | Branch lengths β use pairwise_wrf_with_snapshots_from_newick_iter or pairwise_kf_with_snapshots_from_newick_iter |
Note: sum(presence[i] XOR presence[j]) == RF(tree_i, tree_j) exactly.
Snap files are written and read by the CLI only. From Python, use
pairwise_rf_with_snapshots_from_newick_iter to obtain the same presence
matrix in memory without writing a file β see the Python API section above.
import rapidtrees as rtd
import numpy as np
import math
tree_names, rf_bytes, leaf_names, n_bip, pres_bytes, bip_clade_bytes = (
rtd.pairwise_rf_with_snapshots_from_newick_iter(
names, iter(newicks), translate_maps, map_indices
)
)
n = len(tree_names)
presence = np.frombuffer(pres_bytes, dtype=np.uint8).reshape(n, n_bip).copy()
# Global split frequencies (for Pseudo ESS / ASDSF)
global_freq = presence.mean(axis=0)
# RF distance between any two trees β no recomputation needed
rf_01 = int((presence[0].astype(int) ^ presence[1].astype(int)).sum())
# Named presence matrix β decode clade bitmasks for column labels
bytes_per_bip = math.ceil(len(leaf_names) / 8)
bip_arr = np.frombuffer(bip_clade_bytes, dtype=np.uint8).reshape(n_bip, bytes_per_bip)
bip_bool = np.unpackbits(bip_arr, axis=1, bitorder="little")[:, :len(leaf_names)]
col_labels = ["|".join(leaf_names[i] for i in np.where(bip_bool[j])[0]) for j in range(n_bip)]
import pandas as pd
df = pd.DataFrame(presence, index=tree_names, columns=col_labels)Benchmarks were run on a MacBook Pro M1. Trees are parsed once and bitset snapshots are reused across all pairwise comparisons. Parallelism is provided by rayon β no manual thread management needed.
Show full benchmark table
Output of
cargo bench --bench memory_time_benchmark
| Taxa (N) | Trees (T) | Combinations | Est. Memory | Actual Memory | Wall Time | CPU Time |
|---|---|---|---|---|---|---|
| 10 | 100 | 5.0K | 13.96 KB | 17.87 KB | 172.29 Β΅s | 733.00 Β΅s |
| 10 | 1000 | 500.0K | 139.65 KB | 168.95 KB | 5.94 ms | 12.99 ms |
| 10 | 10000 | 50.0M | 1.36 MB | 1.64 MB | 717.43 ms | 1.36 s |
| 10 | 100000 | 5.0B | 13.64 MB | 16.40 MB | 3.53 min | 4.04 min |
| 100 | 100 | 5.0K | 133.59 KB | 136.09 KB | 244.42 Β΅s | 1.75 ms |
| 100 | 1000 | 500.0K | 1.30 MB | 1.21 MB | 17.49 ms | 50.99 ms |
| 100 | 10000 | 50.0M | 13.05 MB | 11.95 MB | 1.08 s | 4.81 s |
| 100 | 100000 | 5.0B | 130.46 MB | 119.41 MB | 4.29 min | 9.86 min |
| 500 | 100 | 5.0K | 706.84 KB | 713.93 KB | 2.02 ms | 3.08 ms |
| 500 | 1000 | 500.0K | 6.90 MB | 5.89 MB | 27.77 ms | 216.87 ms |
| 500 | 10000 | 50.0M | 69.03 MB | 57.84 MB | 2.82 s | 21.30 s |
| 500 | 100000 | 5.0B | 690.27 MB | 577.28 MB | 7.44 min | 37.13 min |
| 1000 | 100 | 5.0K | 1.50 MB | 1.51 MB | 873.75 Β΅s | 5.63 ms |
| 1000 | 1000 | 500.0K | 15.03 MB | 11.86 MB | 51.22 ms | 420.94 ms |
| 1000 | 10000 | 50.0M | 150.32 MB | 115.30 MB | 5.12 s | 42.92 s |
| 1000 | 100000 | 5.0B | 1.47 GB | 1.12 GB | 11.26 min | 74.74 min |
| 2000 | 100 | 5.0K | 3.50 MB | 3.51 MB | 1.14 ms | 9.92 ms |
| 2000 | 1000 | 500.0K | 35.01 MB | 24.15 MB | 93.73 ms | 838.01 ms |
| 2000 | 10000 | 50.0M | 350.06 MB | 230.59 MB | 9.42 s | 1.38 min |
| 2000 | 100000 | 5.0B | 3.42 GB | 2.24 GB | 18.57 min | 141.11 min |
| 5000 | 100 | 5.0K | 14.30 MB | 12.43 MB | 61.36 ms | 83.85 ms |
| 5000 | 1000 | 500.0K | 143.02 MB | 63.97 MB | 224.40 ms | 2.09 s |
| 5000 | 10000 | 50.0M | 1.40 GB | 579.40 MB | 22.45 s | 3.44 min |
Note: Weighted RF and KF produce floating-point matrices; RF produces integer matrices.
- No trees parsed? Verify the input is a valid NEXUS
.treesfile and adjust--burnin-*settings. - Piping to other tools? Use
-qto suppress timing messages on stdout. - Gzipped output not working? Ensure the output filename ends with
.gz.
rapidtrees is provided under the MIT License.