dev: kdist<->divergence calibration harness (KDIST_CUTOFF study) by cjfields · Pull Request #59 · HPCBio/dada2-rs

cjfields · 2026-06-21T20:54:41Z

Summary

Adds an ignored analysis test (src/kdist_calibrate.rs) that empirically re-derives the relationship between DADA2's k-mer distance (kmer_dist8, our port of the ESPRIT metric) and true alignment divergence (align_endsfree), so KDIST_CUTOFF (0.42, nominally ~10% divergence, calibrated on Illumina 16S) can be checked per dataset / platform / pooling regime.

Motivation: the k-mer distance definition traces to ESPRIT (Sun et al. 2009, reference implementation no longer available) and the 0.42/10% calibration to DADA2 (Callahan et al. 2016) — thinly documented. The Rosen 2012 DADA paper uses ESPRIT's kmerdist as a black box. So measuring the relationship directly is the only reliable way to know whether 0.42 still makes sense outside its original Illumina-16S regime.

What it does

Reads a derep JSON (uniques[].sequence) or seqtab JSON (sequences[] + counts), samples unique-sequence pairs (random-subsample above --maxpairs to bound the O(n²) cost), and emits per-pair: kdist, edits, core_len, pct_div, screened_in, ab_i, ab_j. Summary reports screened-in fraction and leakage (screened-in but >LEAK_PCT% divergent — too far to be an error copy). Pooling mode = which sequence set you feed it (per-sample derep vs pooled table).

First Illumina V4 result (sam1F, 449 uniques, k=5)

kdist bin	median divergence
0.00–0.10	1.2%
0.30–0.40	10.0%
0.40–0.42	14.2%

kdist=0.42 ≈ 14.6% divergence here, not 10% (~10% sits at kdist≈0.29) — the cutoff is measurably looser than its nominal calibration.
All ≤3%-divergent pairs are screened in (the screen is safe in the error-copy regime — 0 fall above 0.42).
~55% of screened-in pairs are >5% divergent — the wasted-alignment leakage, now quantified.

Notes

Test-only module (#[cfg(test)] mod), no shipping-code or CLI surface; same idiom as the bench/wfa_diagnose harnesses.
kdist↔%div is length- and k-dependent; long-read (PacBio) mapping is the next experiment.

Test plan

DADA2RS_KDIST_INPUT=derep.json cargo test --release --bins kdist_calibrate -- --ignored --nocapture → writes kdist.csv + summary. Validated on a sam1F derep.

🤖 Generated with Claude Code

Add an ignored analysis test (src/kdist_calibrate.rs) that empirically re-derives the relationship between DADA2's k-mer distance (kmer_dist8, our port of the ESPRIT metric) and true alignment divergence (align_endsfree), so KDIST_CUTOFF (0.42, nominally ~10% divergence, calibrated on Illumina 16S) can be checked per dataset / platform / pooling regime. The ESPRIT reference implementation is gone and the 0.42/10% calibration is thinly documented, so measuring it directly is the only reliable check. Reads a derep JSON (uniques[].sequence) or a seqtab JSON (sequences[] + counts), samples unique-sequence pairs (random-subsample above --maxpairs to bound the O(n^2) cost), and emits per-pair kdist, edit count, aligned-core length, % divergence, screened-in flag, and the two abundances. Summary reports the screened-in fraction and the leakage (screened-in but >LEAK_PCT% divergent — too far to be an error copy). Pooling mode = which sequence set you feed it. First Illumina V4 result (sam1F, 449 uniques, k=5): kdist=0.42 corresponds to ~14.6% divergence here (not 10%; ~10% sits at kdist~0.29); all <=3%-divergent pairs are screened in (screen is safe in the error-copy regime); ~55% of screened-in pairs are >5% divergent (the wasted-alignment leakage). Test-only module (#[cfg(test)] mod), no shipping-code or CLI surface. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

… threaded) Turn the ignored-test harness into a hidden `kdist-calibrate` subcommand so it can be run across datasets without a test invocation. Changes vs the test: - Hidden CLI command (#[command(hide=true)]) with real flags: --k, --cutoff, --leak-pct, --band, --max-pairs, --max-uniques, --per-sample, --threads, --seed, -o/--output, --verbose. - Multiple derep JSON inputs (gzip-transparent, derep-only). Default pools all uniques (full-pool regime); --per-sample computes pairs within each sample (independent regime) and tags rows by sample. Pseudo == per-sample for the screen population. - Subsampling: --max-pairs (random pairs per population) and --max-uniques (random uniques per sample) bound the O(n^2) cost. - The NW alignments (the cost) are parallelised across --threads via rayon with per-thread AlignBuffers reuse. Still unbanded by default (--band -1) so the divergence of distant pairs isn't truncated. Reproduces the prior harness numbers exactly (sam1F k=5: 81.0% screened-in, 44555 leakage). CSV: sample,kdist,edits,core_len,pct_div,screened_in,ab_i,ab_j. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Add --nearest-parent: instead of random pairs, link each unique to its nearest MORE-abundant neighbour (its candidate error-copy "parent", mirroring DADA2's greedy center-based comparison) and measure how far that link sits below the 0.42 cutoff. The distribution of parent-link kdist is the empirical error-copy distance ceiling; cutoff − ceiling is the screen's HEADROOM (how much 0.42 could be tightened without losing real error copies). Output columns: sample,ab,parent_ab,ab_ratio,kdist,edits,core_len,pct_div, screened_in. Verbose summary reports median/p90 parent-link kdist, % within cutoff, the clear-error-copy ceiling (max kdist among <=3%-divergent links), and the implied headroom. Cheaper than pairs mode: O(n^2) cheap kdist to find the parent + O(n) alignments (vs O(n^2) alignments), so it scales to pooled/multisample far better. First results: Illumina sam1F k=5 ceiling 0.127 -> headroom 0.293 (0.42 is ~3x looser than needed here); PacBio samPB k=7 ceiling 0.111 -> headroom 0.309. A per-sample lower bound on safe tightening — validate across samples before retuning. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…ically) Add a band_req column (both modes) = the minimum diagonal band that reproduces each pair's true unbanded alignment (max |#seq1 − #seq2 bases consumed| along the path). A banded aligner with band < band_req truncates that alignment to a worse score, so band_req is what the band must cover to align the pair correctly. --verbose now prints a band-fit sweep: for candidate bands [2,4,8,16,32,64,128], the fraction of alignments that fit (all pairs + screened-in in pairs mode; clear-error-copy links in nearest-parent mode). Lets the band be derived per platform instead of inheriting the default 16 without justification. First results: Illumina sam1F error-copies need band <=1 (all pairs <=3) — 16 is ~5-16x over-provisioned; PacBio samPB k=7 error-copies reach band_req 10 (CCS homopolymer indels) so 16 covers 100% with margin, band 8 misses ~1%. The default 16 is worst-case (long-read) sized and over-applied to short reads. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Document the hidden, experimental kdist-calibrate subcommand under a new Diagnostics nav section. Covers: the experimental/hidden warning, what it measures (KDIST_CUTOFF screen + BAND_SIZE), invocation/flags, the two modes (all-pairs, abundance-aware --nearest-parent) and their CSV columns, the --verbose summaries (leakage, headroom, band-fit), pandas/CSV processing recipes, preliminary outcomes (0.42≈14.6% on Illumina, k=5 long-read screen saturation, headroom, band-fit), and prominent caveats — chiefly the tiny single-sample datasets behind the preliminary numbers. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…iagnostic tools here at a later point

cjfields and others added 6 commits June 21, 2026 15:53

Update the wording slightly, and a small reformat; we may add other d…

52eb2ad

…iagnostic tools here at a later point

cjfields merged commit 690a06f into main Jun 21, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

dev: kdist<->divergence calibration harness (KDIST_CUTOFF study)#59

dev: kdist<->divergence calibration harness (KDIST_CUTOFF study)#59
cjfields merged 6 commits into
mainfrom
dev/kdist-calibrate

cjfields commented Jun 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

cjfields commented Jun 21, 2026

Summary

What it does

First Illumina V4 result (sam1F, 449 uniques, k=5)

Notes

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant