rapidtrees exposes six functions from its Rust core via PyO3. All accept a
Python iterator of newick strings, which lets the library stream through
arbitrarily large tree files without materialising all strings in memory at
once.
| Function | Returns |
|---|---|
pairwise_rf_from_newick_iter |
(names, bytes) — RF matrix as flat uint32 bytes, row-major |
pairwise_rf_with_snapshots_from_newick_iter |
(names, bytes, leaf_names, n_bip, bytes, bytes) — RF matrix + bipartition presence matrix + bipartition clade bitmasks |
pairwise_wrf_from_newick_iter |
(names, list[float]) — Weighted RF, flat row-major |
pairwise_wrf_with_snapshots_from_newick_iter |
(names, bytes, leaf_names, n_bip, bytes, bytes) — wRF matrix + branch-length matrix + bipartition clade bitmasks |
pairwise_kf_from_newick_iter |
(names, list[float]) — Kuhner-Felsenstein, flat row-major |
pairwise_kf_with_snapshots_from_newick_iter |
(names, bytes, leaf_names, n_bip, bytes, bytes) — KF matrix + branch-length matrix + bipartition clade bitmasks |
All six share the same call signature:
func(
names, # list[str] — one identifier per tree
newick_iter, # Iterator[str] — newick strings, consumed once left to right
translate_maps, # list[dict[str, str]] — taxon-ID → name mappings (BEAST TRANSLATE block)
map_indices, # list[int] — which translate map applies to each tree
rooted=False, # bool — compare clades (True) or bipartitions (False)
progress=None, # ProgressCounter | None — see "Progress reporting" below
)translate_maps and map_indices handle BEAST numeric taxon IDs. If your
newick strings already use real taxon names, pass [{}] and [0] * n.
To display a progress bar (or any kind of UI update) while a pairwise call
runs, instantiate a rapidtrees.ProgressCounter, hand it to the function,
and read it from another Python thread while the call blocks.
import threading, time, rapidtrees
from tqdm import tqdm
counter = rapidtrees.ProgressCounter()
t = threading.Thread(
target=rapidtrees.pairwise_rf_from_newick_iter,
args=(names, iter(newicks), translate_maps, map_indices),
kwargs={"progress": counter},
)
t.start()
with tqdm(total=1.0, unit="frac", bar_format="{l_bar}{bar}|{n:.2f}/{total}") as bar:
while t.is_alive():
bar.n = counter.fraction()
bar.refresh()
time.sleep(0.1)
bar.n = 1.0
bar.refresh()
t.join()ProgressCounter methods (all lock-free atomic reads):
| Method | Returns |
|---|---|
.value() |
int — pairs completed so far |
.total() |
int — n*(n-1)/2 (0 before any call has started) |
.fraction() |
float in [0.0, 1.0], clamped |
.reset() |
clears both value and total to 0 |
Notes:
progress=None(the default) holds the GIL throughout the call and adds zero overhead — identical to the pre-0.6.0behaviour.- When a counter is supplied the GIL is released for the duration of the rayon loop, so a polling Python thread runs unimpeded. Rust never calls back into Python.
- The counter ends at exactly
value == total(fraction() == 1.0) after the function returns.
| Function | Return type | How to decode |
|---|---|---|
pairwise_rf_from_newick_iter |
bytes — flat uint32, row-major |
np.frombuffer(b, dtype=np.uint32).reshape(n, n) |
pairwise_wrf_from_newick_iter |
list[float] — flat, row-major |
np.array(lst, dtype=np.float64).reshape(n, n) |
pairwise_kf_from_newick_iter |
list[float] — flat, row-major |
np.array(lst, dtype=np.float64).reshape(n, n) |
pairwise_rf_with_snapshots_from_newick_iter |
6-tuple — see below | see below |
pairwise_wrf_with_snapshots_from_newick_iter |
6-tuple — see below | see below |
pairwise_kf_with_snapshots_from_newick_iter |
6-tuple — see below | see below |
All functions raise ValueError when:
- Fewer than 2 trees are provided
len(names) != len(map_indices)- A
map_indicesvalue is out of bounds - Trees have different leaf sets
import rapidtrees as rtd
import numpy as np
trees = [
"(A:0.1,(B:0.1,C:0.1):0.1);",
"(A:0.1,(C:0.1,B:0.1):0.1);",
"((A:0.1,B:0.1):0.1,C:0.1);",
]
names = ["t1", "t2", "t3"]
# RF — raw uint32 bytes
tree_names, rf_bytes = rtd.pairwise_rf_from_newick_iter(
names, iter(trees), [{}], [0, 0, 0]
)
n = len(tree_names)
rf = np.frombuffer(rf_bytes, dtype=np.uint32).reshape(n, n)
# Weighted RF — flat list of floats
tree_names, wrf_flat = rtd.pairwise_wrf_from_newick_iter(
names, iter(trees), [{}], [0, 0, 0]
)
wrf = np.array(wrf_flat, dtype=np.float64).reshape(n, n)
# Kuhner-Felsenstein
tree_names, kf_flat = rtd.pairwise_kf_from_newick_iter(
names, iter(trees), [{}], [0, 0, 0]
)
kf = np.array(kf_flat, dtype=np.float64).reshape(n, n)The Rust API does not expose file reading — parse your .trees file in Python
and feed the newick strings to the iterator API. The translate maps carry the
BEAST TRANSLATE block so that numeric taxon IDs are resolved to real names.
import re
from pathlib import Path
import rapidtrees as rtd
import numpy as np
def load_beast(path, burnin_trees=0, use_real_taxa=True):
"""Parse a BEAST .trees file. Returns (translate_map, [(name, newick)])."""
content = Path(path).read_text()
base = Path(path).stem
translate = {}
if use_real_taxa:
in_block = False
for line in content.splitlines():
s = line.strip()
if s.upper().startswith("TRANSLATE"):
in_block = True
continue
if in_block:
if s.startswith(";"):
break
parts = s.rstrip(",").split()
if len(parts) >= 2:
translate[parts[0]] = parts[1].strip("'")
pairs = []
for idx, line in enumerate(content.splitlines()):
upper = line.upper().lstrip()
if upper.startswith("TREE ") and " = " in line:
header, newick = line.split(" = ", 1)
m = re.search(r"STATE_(\d+)", header, re.IGNORECASE)
name = f"{base}_{header.split()[-1]}" if m else f"{base}_tree_{idx}"
newick = re.sub(r"\[&[^\]]*\]", "", newick.strip()) # strip BEAST annotations
pairs.append((name, newick))
return translate, pairs[burnin_trees:]
tmap, tree_pairs = load_beast("run1.trees", burnin_trees=100)
names, newicks = zip(*tree_pairs)
tree_names, rf_bytes = rtd.pairwise_rf_from_newick_iter(
list(names), iter(newicks), [tmap], [0] * len(names)
)
n = len(tree_names)
rf = np.frombuffer(rf_bytes, dtype=np.uint32).reshape(n, n)When combining trees from several files, give each file its own translate map
and use map_indices to say which map applies to each tree:
all_names, all_newicks, translate_maps, map_indices = [], [], [], []
for path in ["run1.trees", "run2.trees"]:
tmap, pairs = load_beast(path, burnin_trees=100)
idx = len(translate_maps)
translate_maps.append(tmap)
for name, newick in pairs:
all_names.append(name)
all_newicks.append(newick)
map_indices.append(idx)
tree_names, rf_bytes = rtd.pairwise_rf_from_newick_iter(
all_names, iter(all_newicks), translate_maps, map_indices
)pairwise_rf_with_snapshots_from_newick_iter builds both the RF distance
matrix and the bipartition presence matrix in a single parse, returning a
6-tuple:
(tree_names, rf_bytes, leaf_names, n_bip, presence_bytes, bipartition_clade_bytes)
| Field | Type | Description |
|---|---|---|
tree_names |
list[str] |
Tree identifiers (same order as input names) |
rf_bytes |
bytes |
Flat uint32 RF matrix, row-major, shape (n, n) |
leaf_names |
list[str] |
Sorted taxon names — index i corresponds to bit i in every bipartition |
n_bip |
int |
Number of unique edges across all trees (internal bipartitions + pendant edges) |
presence_bytes |
bytes |
Flat uint8 presence matrix, row-major, shape (n, n_bip) |
bipartition_clade_bytes |
bytes |
Packed bitmasks, shape (n_bip, ceil(n_leaves/8)) — see below |
The presence matrix entry presence[i, j] is 1 if edge j appears in tree
i, otherwise 0. Column order is deterministic (ascending Bitset order)
and stable across calls on the same tree set.
n_bip counts all edges: pendant (leaf) edges and internal bipartitions.
Pendant edges appear as single-bit rows (one bit set), sorted before internal
bipartitions. Since all trees share the same leaf set, pendant columns are always
1 in every row of the presence matrix.
For internal bipartitions (rows with ≥ 2 bits set) the canonical side is the
half that does not contain the first leaf alphabetically, so bit 0 is never
set in those rows. Pendant edges are stored verbatim (no flip), so the pendant
of the first leaf has bit 0 set. The complement of an internal bipartition can
be derived as ~bip_bool[j] & True (masked to n_leaves bits).
bipartition_clade_bytes is a flat bytes buffer of shape
(n_bip, ceil(n_leaves / 8)). Each row encodes the canonical leaf membership
of one bipartition as a little-endian bitmask: bit i within each row is
1 if leaf_names[i] is on that bipartition's canonical side.
Decode with NumPy:
import math
import numpy as np
bytes_per_bip = math.ceil(len(leaf_names) / 8)
bip_arr = np.frombuffer(bipartition_clade_bytes, dtype=np.uint8).reshape(n_bip, bytes_per_bip)
bip_bool = np.unpackbits(bip_arr, axis=1, bitorder='little')[:, :len(leaf_names)]
# bip_bool[j, i] == 1 → leaf_names[i] is in bipartition jtree_names, rf_bytes, leaf_names, n_bip, pres_bytes, bip_clade_bytes = (
rtd.pairwise_rf_with_snapshots_from_newick_iter(
list(names), iter(newicks), [tmap], [0] * len(names)
)
)
n = len(tree_names)
rf = np.frombuffer(rf_bytes, dtype=np.uint32).reshape(n, n)
presence = np.frombuffer(pres_bytes, dtype=np.uint8 ).reshape(n, n_bip).copy()
# Verify: sum(|row_i − row_j|) == RF(i, j) for all pairs
for i in range(n):
for j in range(n):
assert int(np.sum(np.abs(presence[i].astype(int) - presence[j].astype(int)))) == int(rf[i, j])
# Global split frequencies across all trees (useful for Pseudo-ESS / ASDSF)
split_freq = presence.mean(axis=0)Decode bipartition_clade_bytes to build human-readable column labels for the
presence matrix, enabling downstream analyses such as tanglegrams, identifying
unstable splits, or computing per-clade frequencies:
import math
import pandas as pd
bytes_per_bip = math.ceil(len(leaf_names) / 8)
bip_arr = np.frombuffer(bip_clade_bytes, dtype=np.uint8).reshape(n_bip, bytes_per_bip)
bip_bool = np.unpackbits(bip_arr, axis=1, bitorder='little')[:, :len(leaf_names)]
# Build a human-readable column label for each bipartition
col_labels = [
"|".join(n for i, n in enumerate(leaf_names) if bip_bool[j, i])
for j in range(n_bip)
]
df = pd.DataFrame(presence, index=tree_names, columns=col_labels)
# e.g. col "C|D|E" == 1 means the split {C,D,E}|rest is present in that treepairwise_wrf_with_snapshots_from_newick_iter builds both the pairwise wRF
distance matrix and a per-edge branch-length matrix in a single parse,
returning a 6-tuple:
(tree_names, wrf_bytes, leaf_names, n_bip, branch_length_bytes, bipartition_clade_bytes)
| Field | Type | Description |
|---|---|---|
tree_names |
list[str] |
Tree identifiers (same order as input names) |
wrf_bytes |
bytes |
Flat float64 wRF matrix, row-major, shape (n, n) |
leaf_names |
list[str] |
Sorted taxon names — index i corresponds to bit i in every bipartition |
n_bip |
int |
Number of unique edges across all trees (pendant + internal) |
branch_length_bytes |
bytes |
Flat float64, shape (n_trees, n_bip), row-major |
bipartition_clade_bytes |
bytes |
Packed bitmasks, shape (n_bip, ceil(n_leaves/8)) — identical to RF snapshot |
branch_length_bytes[i, j] is the branch length of edge j in tree i, or
0.0 if that edge is absent. Pendant (leaf-edge) columns are always non-zero
because every tree has every leaf.
Column order matches bipartition_clade_bytes (ascending Bitset order,
deterministic and stable across calls on the same tree set).
import rapidtrees as rtd
import numpy as np
tree_names, wrf_bytes, leaf_names, n_bip, bl_bytes, bip_clade_bytes = (
rtd.pairwise_wrf_with_snapshots_from_newick_iter(
list(names), iter(newicks), [tmap], [0] * len(names)
)
)
n = len(tree_names)
# Actual wRF distance matrix (trees x trees)
wrf = np.frombuffer(wrf_bytes, dtype=np.float64).reshape(n, n)
# Branch length matrix (trees x bipartitions)
bl = np.frombuffer(bl_bytes, dtype=np.float64).reshape(n, n_bip)
# We can recompute distances relative to one reference tree using the branch-length matrix:
ref_idx = 0
wrf_trace = np.sum(np.abs(bl[ref_idx, :] - bl), axis=1) # shape (n,)
kf_trace = np.sqrt(np.sum((bl[ref_idx, :] - bl) ** 2, axis=1)) # shape (n,)
# Verify L1 identity: sum(|bl[i,:] - bl[j,:]|) == wrf[i,j]
for i in range(n):
for j in range(n):
assert abs(float(np.sum(np.abs(bl[i] - bl[j]))) - wrf[i, j]) < 1e-9pairwise_kf_with_snapshots_from_newick_iter is identical to the wRF variant
except the distance matrix uses the Kuhner–Felsenstein (Branch Score) metric.
The branch-length matrix is metric-agnostic and bit-for-bit identical across
both functions for the same input.
(tree_names, kf_bytes, leaf_names, n_bip, branch_length_bytes, bipartition_clade_bytes)
| Field | Type | Description |
|---|---|---|
tree_names |
list[str] |
Tree identifiers |
kf_bytes |
bytes |
Flat float64 KF matrix, row-major, shape (n, n) |
leaf_names |
list[str] |
Sorted taxon names |
n_bip |
int |
Number of unique edges |
branch_length_bytes |
bytes |
Flat float64, shape (n_trees, n_bip) — same as wRF variant |
bipartition_clade_bytes |
bytes |
Packed bitmasks — same as RF/wRF snapshot |
tree_names, kf_bytes, leaf_names, n_bip, bl_bytes, bip_clade_bytes = (
rtd.pairwise_kf_with_snapshots_from_newick_iter(
list(names), iter(newicks), [tmap], [0] * len(names)
)
)
n = len(tree_names)
# Actual KF distance matrix (trees x trees)
kf = np.frombuffer(kf_bytes, dtype=np.float64).reshape(n, n)
# Branch length matrix (trees x bipartitions)
bl = np.frombuffer(bl_bytes, dtype=np.float64).reshape(n, n_bip)
# L2 identity: sqrt(sum((bl[i,:]-bl[j,:])**2)) == kf[i,j]