Skip to content

dev: kdist<->divergence calibration harness (KDIST_CUTOFF study)#59

Merged
cjfields merged 6 commits into
mainfrom
dev/kdist-calibrate
Jun 21, 2026
Merged

dev: kdist<->divergence calibration harness (KDIST_CUTOFF study)#59
cjfields merged 6 commits into
mainfrom
dev/kdist-calibrate

Conversation

@cjfields

Copy link
Copy Markdown
Member

Summary

Adds an ignored analysis test (src/kdist_calibrate.rs) that empirically re-derives the relationship between DADA2's k-mer distance (kmer_dist8, our port of the ESPRIT metric) and true alignment divergence (align_endsfree), so KDIST_CUTOFF (0.42, nominally ~10% divergence, calibrated on Illumina 16S) can be checked per dataset / platform / pooling regime.

Motivation: the k-mer distance definition traces to ESPRIT (Sun et al. 2009, reference implementation no longer available) and the 0.42/10% calibration to DADA2 (Callahan et al. 2016) — thinly documented. The Rosen 2012 DADA paper uses ESPRIT's kmerdist as a black box. So measuring the relationship directly is the only reliable way to know whether 0.42 still makes sense outside its original Illumina-16S regime.

What it does

Reads a derep JSON (uniques[].sequence) or seqtab JSON (sequences[] + counts), samples unique-sequence pairs (random-subsample above --maxpairs to bound the O(n²) cost), and emits per-pair: kdist, edits, core_len, pct_div, screened_in, ab_i, ab_j. Summary reports screened-in fraction and leakage (screened-in but >LEAK_PCT% divergent — too far to be an error copy). Pooling mode = which sequence set you feed it (per-sample derep vs pooled table).

First Illumina V4 result (sam1F, 449 uniques, k=5)

kdist bin median divergence
0.00–0.10 1.2%
0.30–0.40 10.0%
0.40–0.42 14.2%
  • kdist=0.42 ≈ 14.6% divergence here, not 10% (~10% sits at kdist≈0.29) — the cutoff is measurably looser than its nominal calibration.
  • All ≤3%-divergent pairs are screened in (the screen is safe in the error-copy regime — 0 fall above 0.42).
  • ~55% of screened-in pairs are >5% divergent — the wasted-alignment leakage, now quantified.

Notes

  • Test-only module (#[cfg(test)] mod), no shipping-code or CLI surface; same idiom as the bench/wfa_diagnose harnesses.
  • kdist↔%div is length- and k-dependent; long-read (PacBio) mapping is the next experiment.

Test plan

DADA2RS_KDIST_INPUT=derep.json cargo test --release --bins kdist_calibrate -- --ignored --nocapture → writes kdist.csv + summary. Validated on a sam1F derep.

🤖 Generated with Claude Code

cjfields and others added 6 commits June 21, 2026 15:53
Add an ignored analysis test (src/kdist_calibrate.rs) that empirically re-derives
the relationship between DADA2's k-mer distance (kmer_dist8, our port of the
ESPRIT metric) and true alignment divergence (align_endsfree), so KDIST_CUTOFF
(0.42, nominally ~10% divergence, calibrated on Illumina 16S) can be checked per
dataset / platform / pooling regime. The ESPRIT reference implementation is gone
and the 0.42/10% calibration is thinly documented, so measuring it directly is
the only reliable check.

Reads a derep JSON (uniques[].sequence) or a seqtab JSON (sequences[] + counts),
samples unique-sequence pairs (random-subsample above --maxpairs to bound the
O(n^2) cost), and emits per-pair kdist, edit count, aligned-core length, %
divergence, screened-in flag, and the two abundances. Summary reports the
screened-in fraction and the leakage (screened-in but >LEAK_PCT% divergent — too
far to be an error copy). Pooling mode = which sequence set you feed it.

First Illumina V4 result (sam1F, 449 uniques, k=5): kdist=0.42 corresponds to
~14.6% divergence here (not 10%; ~10% sits at kdist~0.29); all <=3%-divergent
pairs are screened in (screen is safe in the error-copy regime); ~55% of
screened-in pairs are >5% divergent (the wasted-alignment leakage).

Test-only module (#[cfg(test)] mod), no shipping-code or CLI surface.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… threaded)

Turn the ignored-test harness into a hidden `kdist-calibrate` subcommand so it
can be run across datasets without a test invocation. Changes vs the test:

- Hidden CLI command (#[command(hide=true)]) with real flags: --k, --cutoff,
  --leak-pct, --band, --max-pairs, --max-uniques, --per-sample, --threads,
  --seed, -o/--output, --verbose.
- Multiple derep JSON inputs (gzip-transparent, derep-only). Default pools all
  uniques (full-pool regime); --per-sample computes pairs within each sample
  (independent regime) and tags rows by sample. Pseudo == per-sample for the
  screen population.
- Subsampling: --max-pairs (random pairs per population) and --max-uniques
  (random uniques per sample) bound the O(n^2) cost.
- The NW alignments (the cost) are parallelised across --threads via rayon with
  per-thread AlignBuffers reuse. Still unbanded by default (--band -1) so the
  divergence of distant pairs isn't truncated.

Reproduces the prior harness numbers exactly (sam1F k=5: 81.0% screened-in,
44555 leakage). CSV: sample,kdist,edits,core_len,pct_div,screened_in,ab_i,ab_j.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add --nearest-parent: instead of random pairs, link each unique to its nearest
MORE-abundant neighbour (its candidate error-copy "parent", mirroring DADA2's
greedy center-based comparison) and measure how far that link sits below the
0.42 cutoff. The distribution of parent-link kdist is the empirical error-copy
distance ceiling; cutoff − ceiling is the screen's HEADROOM (how much 0.42 could
be tightened without losing real error copies).

Output columns: sample,ab,parent_ab,ab_ratio,kdist,edits,core_len,pct_div,
screened_in. Verbose summary reports median/p90 parent-link kdist, % within
cutoff, the clear-error-copy ceiling (max kdist among <=3%-divergent links), and
the implied headroom.

Cheaper than pairs mode: O(n^2) cheap kdist to find the parent + O(n) alignments
(vs O(n^2) alignments), so it scales to pooled/multisample far better.

First results: Illumina sam1F k=5 ceiling 0.127 -> headroom 0.293 (0.42 is ~3x
looser than needed here); PacBio samPB k=7 ceiling 0.111 -> headroom 0.309. A
per-sample lower bound on safe tightening — validate across samples before
retuning.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ically)

Add a band_req column (both modes) = the minimum diagonal band that reproduces
each pair's true unbanded alignment (max |#seq1 − #seq2 bases consumed| along
the path). A banded aligner with band < band_req truncates that alignment to a
worse score, so band_req is what the band must cover to align the pair correctly.

--verbose now prints a band-fit sweep: for candidate bands [2,4,8,16,32,64,128],
the fraction of alignments that fit (all pairs + screened-in in pairs mode;
clear-error-copy links in nearest-parent mode). Lets the band be derived per
platform instead of inheriting the default 16 without justification.

First results: Illumina sam1F error-copies need band <=1 (all pairs <=3) — 16 is
~5-16x over-provisioned; PacBio samPB k=7 error-copies reach band_req 10 (CCS
homopolymer indels) so 16 covers 100% with margin, band 8 misses ~1%. The
default 16 is worst-case (long-read) sized and over-applied to short reads.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Document the hidden, experimental kdist-calibrate subcommand under a new
Diagnostics nav section. Covers: the experimental/hidden warning, what it
measures (KDIST_CUTOFF screen + BAND_SIZE), invocation/flags, the two modes
(all-pairs, abundance-aware --nearest-parent) and their CSV columns, the
--verbose summaries (leakage, headroom, band-fit), pandas/CSV processing
recipes, preliminary outcomes (0.42≈14.6% on Illumina, k=5 long-read screen
saturation, headroom, band-fit), and prominent caveats — chiefly the tiny
single-sample datasets behind the preliminary numbers.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@cjfields cjfields merged commit 690a06f into main Jun 21, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant