Skip to content

Latest commit

 

History

History
418 lines (325 loc) · 15 KB

File metadata and controls

418 lines (325 loc) · 15 KB

Python API reference

rapidtrees exposes six functions from its Rust core via PyO3. All accept a Python iterator of newick strings, which lets the library stream through arbitrarily large tree files without materialising all strings in memory at once.


Functions

Function Returns
pairwise_rf_from_newick_iter (names, bytes) — RF matrix as flat uint32 bytes, row-major
pairwise_rf_with_snapshots_from_newick_iter (names, bytes, leaf_names, n_bip, bytes, bytes) — RF matrix + bipartition presence matrix + bipartition clade bitmasks
pairwise_wrf_from_newick_iter (names, list[float]) — Weighted RF, flat row-major
pairwise_wrf_with_snapshots_from_newick_iter (names, bytes, leaf_names, n_bip, bytes, bytes) — wRF matrix + branch-length matrix + bipartition clade bitmasks
pairwise_kf_from_newick_iter (names, list[float]) — Kuhner-Felsenstein, flat row-major
pairwise_kf_with_snapshots_from_newick_iter (names, bytes, leaf_names, n_bip, bytes, bytes) — KF matrix + branch-length matrix + bipartition clade bitmasks

All six share the same call signature:

func(
    names,            # list[str]            — one identifier per tree
    newick_iter,      # Iterator[str]        — newick strings, consumed once left to right
    translate_maps,   # list[dict[str, str]] — taxon-ID → name mappings (BEAST TRANSLATE block)
    map_indices,      # list[int]            — which translate map applies to each tree
    rooted=False,     # bool                 — compare clades (True) or bipartitions (False)
    progress=None,    # ProgressCounter | None — see "Progress reporting" below
)

translate_maps and map_indices handle BEAST numeric taxon IDs. If your newick strings already use real taxon names, pass [{}] and [0] * n.

Progress reporting

To display a progress bar (or any kind of UI update) while a pairwise call runs, instantiate a rapidtrees.ProgressCounter, hand it to the function, and read it from another Python thread while the call blocks.

import threading, time, rapidtrees
from tqdm import tqdm

counter = rapidtrees.ProgressCounter()

t = threading.Thread(
    target=rapidtrees.pairwise_rf_from_newick_iter,
    args=(names, iter(newicks), translate_maps, map_indices),
    kwargs={"progress": counter},
)
t.start()

with tqdm(total=1.0, unit="frac", bar_format="{l_bar}{bar}|{n:.2f}/{total}") as bar:
    while t.is_alive():
        bar.n = counter.fraction()
        bar.refresh()
        time.sleep(0.1)
    bar.n = 1.0
    bar.refresh()

t.join()

ProgressCounter methods (all lock-free atomic reads):

Method Returns
.value() int — pairs completed so far
.total() intn*(n-1)/2 (0 before any call has started)
.fraction() float in [0.0, 1.0], clamped
.reset() clears both value and total to 0

Notes:

  • progress=None (the default) holds the GIL throughout the call and adds zero overhead — identical to the pre-0.6.0 behaviour.
  • When a counter is supplied the GIL is released for the duration of the rayon loop, so a polling Python thread runs unimpeded. Rust never calls back into Python.
  • The counter ends at exactly value == total (fraction() == 1.0) after the function returns.

Return types

Function Return type How to decode
pairwise_rf_from_newick_iter bytes — flat uint32, row-major np.frombuffer(b, dtype=np.uint32).reshape(n, n)
pairwise_wrf_from_newick_iter list[float] — flat, row-major np.array(lst, dtype=np.float64).reshape(n, n)
pairwise_kf_from_newick_iter list[float] — flat, row-major np.array(lst, dtype=np.float64).reshape(n, n)
pairwise_rf_with_snapshots_from_newick_iter 6-tuple — see below see below
pairwise_wrf_with_snapshots_from_newick_iter 6-tuple — see below see below
pairwise_kf_with_snapshots_from_newick_iter 6-tuple — see below see below

Errors raised

All functions raise ValueError when:

  • Fewer than 2 trees are provided
  • len(names) != len(map_indices)
  • A map_indices value is out of bounds
  • Trees have different leaf sets

Examples

Newick strings (no file I/O)

import rapidtrees as rtd
import numpy as np

trees = [
    "(A:0.1,(B:0.1,C:0.1):0.1);",
    "(A:0.1,(C:0.1,B:0.1):0.1);",
    "((A:0.1,B:0.1):0.1,C:0.1);",
]
names = ["t1", "t2", "t3"]

# RF — raw uint32 bytes
tree_names, rf_bytes = rtd.pairwise_rf_from_newick_iter(
    names, iter(trees), [{}], [0, 0, 0]
)
n = len(tree_names)
rf = np.frombuffer(rf_bytes, dtype=np.uint32).reshape(n, n)

# Weighted RF — flat list of floats
tree_names, wrf_flat = rtd.pairwise_wrf_from_newick_iter(
    names, iter(trees), [{}], [0, 0, 0]
)
wrf = np.array(wrf_flat, dtype=np.float64).reshape(n, n)

# Kuhner-Felsenstein
tree_names, kf_flat = rtd.pairwise_kf_from_newick_iter(
    names, iter(trees), [{}], [0, 0, 0]
)
kf = np.array(kf_flat, dtype=np.float64).reshape(n, n)

BEAST .trees files

The Rust API does not expose file reading — parse your .trees file in Python and feed the newick strings to the iterator API. The translate maps carry the BEAST TRANSLATE block so that numeric taxon IDs are resolved to real names.

import re
from pathlib import Path
import rapidtrees as rtd
import numpy as np


def load_beast(path, burnin_trees=0, use_real_taxa=True):
    """Parse a BEAST .trees file. Returns (translate_map, [(name, newick)])."""
    content = Path(path).read_text()
    base = Path(path).stem

    translate = {}
    if use_real_taxa:
        in_block = False
        for line in content.splitlines():
            s = line.strip()
            if s.upper().startswith("TRANSLATE"):
                in_block = True
                continue
            if in_block:
                if s.startswith(";"):
                    break
                parts = s.rstrip(",").split()
                if len(parts) >= 2:
                    translate[parts[0]] = parts[1].strip("'")

    pairs = []
    for idx, line in enumerate(content.splitlines()):
        upper = line.upper().lstrip()
        if upper.startswith("TREE ") and " = " in line:
            header, newick = line.split(" = ", 1)
            m = re.search(r"STATE_(\d+)", header, re.IGNORECASE)
            name = f"{base}_{header.split()[-1]}" if m else f"{base}_tree_{idx}"
            newick = re.sub(r"\[&[^\]]*\]", "", newick.strip())  # strip BEAST annotations
            pairs.append((name, newick))

    return translate, pairs[burnin_trees:]


tmap, tree_pairs = load_beast("run1.trees", burnin_trees=100)
names, newicks = zip(*tree_pairs)

tree_names, rf_bytes = rtd.pairwise_rf_from_newick_iter(
    list(names), iter(newicks), [tmap], [0] * len(names)
)
n = len(tree_names)
rf = np.frombuffer(rf_bytes, dtype=np.uint32).reshape(n, n)

Multiple files

When combining trees from several files, give each file its own translate map and use map_indices to say which map applies to each tree:

all_names, all_newicks, translate_maps, map_indices = [], [], [], []

for path in ["run1.trees", "run2.trees"]:
    tmap, pairs = load_beast(path, burnin_trees=100)
    idx = len(translate_maps)
    translate_maps.append(tmap)
    for name, newick in pairs:
        all_names.append(name)
        all_newicks.append(newick)
        map_indices.append(idx)

tree_names, rf_bytes = rtd.pairwise_rf_from_newick_iter(
    all_names, iter(all_newicks), translate_maps, map_indices
)

RF + bipartition snapshot in one pass

pairwise_rf_with_snapshots_from_newick_iter builds both the RF distance matrix and the bipartition presence matrix in a single parse, returning a 6-tuple:

(tree_names, rf_bytes, leaf_names, n_bip, presence_bytes, bipartition_clade_bytes)
Field Type Description
tree_names list[str] Tree identifiers (same order as input names)
rf_bytes bytes Flat uint32 RF matrix, row-major, shape (n, n)
leaf_names list[str] Sorted taxon names — index i corresponds to bit i in every bipartition
n_bip int Number of unique edges across all trees (internal bipartitions + pendant edges)
presence_bytes bytes Flat uint8 presence matrix, row-major, shape (n, n_bip)
bipartition_clade_bytes bytes Packed bitmasks, shape (n_bip, ceil(n_leaves/8)) — see below

The presence matrix entry presence[i, j] is 1 if edge j appears in tree i, otherwise 0. Column order is deterministic (ascending Bitset order) and stable across calls on the same tree set.

Edge table contents

n_bip counts all edges: pendant (leaf) edges and internal bipartitions. Pendant edges appear as single-bit rows (one bit set), sorted before internal bipartitions. Since all trees share the same leaf set, pendant columns are always 1 in every row of the presence matrix.

Canonicalisation note

For internal bipartitions (rows with ≥ 2 bits set) the canonical side is the half that does not contain the first leaf alphabetically, so bit 0 is never set in those rows. Pendant edges are stored verbatim (no flip), so the pendant of the first leaf has bit 0 set. The complement of an internal bipartition can be derived as ~bip_bool[j] & True (masked to n_leaves bits).

Bipartition clade bytes format

bipartition_clade_bytes is a flat bytes buffer of shape (n_bip, ceil(n_leaves / 8)). Each row encodes the canonical leaf membership of one bipartition as a little-endian bitmask: bit i within each row is 1 if leaf_names[i] is on that bipartition's canonical side.

Decode with NumPy:

import math
import numpy as np

bytes_per_bip = math.ceil(len(leaf_names) / 8)
bip_arr  = np.frombuffer(bipartition_clade_bytes, dtype=np.uint8).reshape(n_bip, bytes_per_bip)
bip_bool = np.unpackbits(bip_arr, axis=1, bitorder='little')[:, :len(leaf_names)]
# bip_bool[j, i] == 1  →  leaf_names[i] is in bipartition j
tree_names, rf_bytes, leaf_names, n_bip, pres_bytes, bip_clade_bytes = (
    rtd.pairwise_rf_with_snapshots_from_newick_iter(
        list(names), iter(newicks), [tmap], [0] * len(names)
    )
)
n = len(tree_names)
rf       = np.frombuffer(rf_bytes,   dtype=np.uint32).reshape(n, n)
presence = np.frombuffer(pres_bytes, dtype=np.uint8 ).reshape(n, n_bip).copy()

# Verify: sum(|row_i − row_j|) == RF(i, j) for all pairs
for i in range(n):
    for j in range(n):
        assert int(np.sum(np.abs(presence[i].astype(int) - presence[j].astype(int)))) == int(rf[i, j])

# Global split frequencies across all trees (useful for Pseudo-ESS / ASDSF)
split_freq = presence.mean(axis=0)

Named presence matrix (post-hoc analysis)

Decode bipartition_clade_bytes to build human-readable column labels for the presence matrix, enabling downstream analyses such as tanglegrams, identifying unstable splits, or computing per-clade frequencies:

import math
import pandas as pd

bytes_per_bip = math.ceil(len(leaf_names) / 8)
bip_arr  = np.frombuffer(bip_clade_bytes, dtype=np.uint8).reshape(n_bip, bytes_per_bip)
bip_bool = np.unpackbits(bip_arr, axis=1, bitorder='little')[:, :len(leaf_names)]

# Build a human-readable column label for each bipartition
col_labels = [
    "|".join(n for i, n in enumerate(leaf_names) if bip_bool[j, i])
    for j in range(n_bip)
]

df = pd.DataFrame(presence, index=tree_names, columns=col_labels)
# e.g.  col "C|D|E" == 1 means the split {C,D,E}|rest is present in that tree

wRF + branch-length matrix in one pass

pairwise_wrf_with_snapshots_from_newick_iter builds both the pairwise wRF distance matrix and a per-edge branch-length matrix in a single parse, returning a 6-tuple:

(tree_names, wrf_bytes, leaf_names, n_bip, branch_length_bytes, bipartition_clade_bytes)
Field Type Description
tree_names list[str] Tree identifiers (same order as input names)
wrf_bytes bytes Flat float64 wRF matrix, row-major, shape (n, n)
leaf_names list[str] Sorted taxon names — index i corresponds to bit i in every bipartition
n_bip int Number of unique edges across all trees (pendant + internal)
branch_length_bytes bytes Flat float64, shape (n_trees, n_bip), row-major
bipartition_clade_bytes bytes Packed bitmasks, shape (n_bip, ceil(n_leaves/8)) — identical to RF snapshot

branch_length_bytes[i, j] is the branch length of edge j in tree i, or 0.0 if that edge is absent. Pendant (leaf-edge) columns are always non-zero because every tree has every leaf.

Column order matches bipartition_clade_bytes (ascending Bitset order, deterministic and stable across calls on the same tree set).

Decode and compute Fréchet ESS traces

import rapidtrees as rtd
import numpy as np

tree_names, wrf_bytes, leaf_names, n_bip, bl_bytes, bip_clade_bytes = (
    rtd.pairwise_wrf_with_snapshots_from_newick_iter(
        list(names), iter(newicks), [tmap], [0] * len(names)
    )
)
n = len(tree_names)
# Actual wRF distance matrix (trees x trees)
wrf = np.frombuffer(wrf_bytes, dtype=np.float64).reshape(n, n)

# Branch length matrix (trees x bipartitions)
bl  = np.frombuffer(bl_bytes,  dtype=np.float64).reshape(n, n_bip)

# We can recompute distances relative to one reference tree using the branch-length matrix:
ref_idx = 0
wrf_trace = np.sum(np.abs(bl[ref_idx, :] - bl), axis=1)           # shape (n,)
kf_trace  = np.sqrt(np.sum((bl[ref_idx, :] - bl) ** 2, axis=1))   # shape (n,)

# Verify L1 identity: sum(|bl[i,:] - bl[j,:]|) == wrf[i,j]
for i in range(n):
    for j in range(n):
        assert abs(float(np.sum(np.abs(bl[i] - bl[j]))) - wrf[i, j]) < 1e-9

KF + branch-length matrix in one pass

pairwise_kf_with_snapshots_from_newick_iter is identical to the wRF variant except the distance matrix uses the Kuhner–Felsenstein (Branch Score) metric. The branch-length matrix is metric-agnostic and bit-for-bit identical across both functions for the same input.

(tree_names, kf_bytes, leaf_names, n_bip, branch_length_bytes, bipartition_clade_bytes)
Field Type Description
tree_names list[str] Tree identifiers
kf_bytes bytes Flat float64 KF matrix, row-major, shape (n, n)
leaf_names list[str] Sorted taxon names
n_bip int Number of unique edges
branch_length_bytes bytes Flat float64, shape (n_trees, n_bip) — same as wRF variant
bipartition_clade_bytes bytes Packed bitmasks — same as RF/wRF snapshot
tree_names, kf_bytes, leaf_names, n_bip, bl_bytes, bip_clade_bytes = (
    rtd.pairwise_kf_with_snapshots_from_newick_iter(
        list(names), iter(newicks), [tmap], [0] * len(names)
    )
)
n = len(tree_names)

# Actual KF distance matrix (trees x trees)
kf = np.frombuffer(kf_bytes, dtype=np.float64).reshape(n, n)

# Branch length matrix (trees x bipartitions)
bl  = np.frombuffer(bl_bytes, dtype=np.float64).reshape(n, n_bip)

# L2 identity: sqrt(sum((bl[i,:]-bl[j,:])**2)) == kf[i,j]