dev: kdist<->divergence calibration harness (KDIST_CUTOFF study)#59
Merged
Conversation
Add an ignored analysis test (src/kdist_calibrate.rs) that empirically re-derives the relationship between DADA2's k-mer distance (kmer_dist8, our port of the ESPRIT metric) and true alignment divergence (align_endsfree), so KDIST_CUTOFF (0.42, nominally ~10% divergence, calibrated on Illumina 16S) can be checked per dataset / platform / pooling regime. The ESPRIT reference implementation is gone and the 0.42/10% calibration is thinly documented, so measuring it directly is the only reliable check. Reads a derep JSON (uniques[].sequence) or a seqtab JSON (sequences[] + counts), samples unique-sequence pairs (random-subsample above --maxpairs to bound the O(n^2) cost), and emits per-pair kdist, edit count, aligned-core length, % divergence, screened-in flag, and the two abundances. Summary reports the screened-in fraction and the leakage (screened-in but >LEAK_PCT% divergent — too far to be an error copy). Pooling mode = which sequence set you feed it. First Illumina V4 result (sam1F, 449 uniques, k=5): kdist=0.42 corresponds to ~14.6% divergence here (not 10%; ~10% sits at kdist~0.29); all <=3%-divergent pairs are screened in (screen is safe in the error-copy regime); ~55% of screened-in pairs are >5% divergent (the wasted-alignment leakage). Test-only module (#[cfg(test)] mod), no shipping-code or CLI surface. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… threaded) Turn the ignored-test harness into a hidden `kdist-calibrate` subcommand so it can be run across datasets without a test invocation. Changes vs the test: - Hidden CLI command (#[command(hide=true)]) with real flags: --k, --cutoff, --leak-pct, --band, --max-pairs, --max-uniques, --per-sample, --threads, --seed, -o/--output, --verbose. - Multiple derep JSON inputs (gzip-transparent, derep-only). Default pools all uniques (full-pool regime); --per-sample computes pairs within each sample (independent regime) and tags rows by sample. Pseudo == per-sample for the screen population. - Subsampling: --max-pairs (random pairs per population) and --max-uniques (random uniques per sample) bound the O(n^2) cost. - The NW alignments (the cost) are parallelised across --threads via rayon with per-thread AlignBuffers reuse. Still unbanded by default (--band -1) so the divergence of distant pairs isn't truncated. Reproduces the prior harness numbers exactly (sam1F k=5: 81.0% screened-in, 44555 leakage). CSV: sample,kdist,edits,core_len,pct_div,screened_in,ab_i,ab_j. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add --nearest-parent: instead of random pairs, link each unique to its nearest MORE-abundant neighbour (its candidate error-copy "parent", mirroring DADA2's greedy center-based comparison) and measure how far that link sits below the 0.42 cutoff. The distribution of parent-link kdist is the empirical error-copy distance ceiling; cutoff − ceiling is the screen's HEADROOM (how much 0.42 could be tightened without losing real error copies). Output columns: sample,ab,parent_ab,ab_ratio,kdist,edits,core_len,pct_div, screened_in. Verbose summary reports median/p90 parent-link kdist, % within cutoff, the clear-error-copy ceiling (max kdist among <=3%-divergent links), and the implied headroom. Cheaper than pairs mode: O(n^2) cheap kdist to find the parent + O(n) alignments (vs O(n^2) alignments), so it scales to pooled/multisample far better. First results: Illumina sam1F k=5 ceiling 0.127 -> headroom 0.293 (0.42 is ~3x looser than needed here); PacBio samPB k=7 ceiling 0.111 -> headroom 0.309. A per-sample lower bound on safe tightening — validate across samples before retuning. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ically) Add a band_req column (both modes) = the minimum diagonal band that reproduces each pair's true unbanded alignment (max |#seq1 − #seq2 bases consumed| along the path). A banded aligner with band < band_req truncates that alignment to a worse score, so band_req is what the band must cover to align the pair correctly. --verbose now prints a band-fit sweep: for candidate bands [2,4,8,16,32,64,128], the fraction of alignments that fit (all pairs + screened-in in pairs mode; clear-error-copy links in nearest-parent mode). Lets the band be derived per platform instead of inheriting the default 16 without justification. First results: Illumina sam1F error-copies need band <=1 (all pairs <=3) — 16 is ~5-16x over-provisioned; PacBio samPB k=7 error-copies reach band_req 10 (CCS homopolymer indels) so 16 covers 100% with margin, band 8 misses ~1%. The default 16 is worst-case (long-read) sized and over-applied to short reads. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Document the hidden, experimental kdist-calibrate subcommand under a new Diagnostics nav section. Covers: the experimental/hidden warning, what it measures (KDIST_CUTOFF screen + BAND_SIZE), invocation/flags, the two modes (all-pairs, abundance-aware --nearest-parent) and their CSV columns, the --verbose summaries (leakage, headroom, band-fit), pandas/CSV processing recipes, preliminary outcomes (0.42≈14.6% on Illumina, k=5 long-read screen saturation, headroom, band-fit), and prominent caveats — chiefly the tiny single-sample datasets behind the preliminary numbers. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…iagnostic tools here at a later point
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds an ignored analysis test (
src/kdist_calibrate.rs) that empirically re-derives the relationship between DADA2's k-mer distance (kmer_dist8, our port of the ESPRIT metric) and true alignment divergence (align_endsfree), soKDIST_CUTOFF(0.42, nominally ~10% divergence, calibrated on Illumina 16S) can be checked per dataset / platform / pooling regime.Motivation: the k-mer distance definition traces to ESPRIT (Sun et al. 2009, reference implementation no longer available) and the 0.42/10% calibration to DADA2 (Callahan et al. 2016) — thinly documented. The Rosen 2012 DADA paper uses ESPRIT's
kmerdistas a black box. So measuring the relationship directly is the only reliable way to know whether 0.42 still makes sense outside its original Illumina-16S regime.What it does
Reads a derep JSON (
uniques[].sequence) or seqtab JSON (sequences[]+counts), samples unique-sequence pairs (random-subsample above--maxpairsto bound the O(n²) cost), and emits per-pair:kdist, edits, core_len, pct_div, screened_in, ab_i, ab_j. Summary reports screened-in fraction and leakage (screened-in but >LEAK_PCT% divergent — too far to be an error copy). Pooling mode = which sequence set you feed it (per-sample derep vs pooled table).First Illumina V4 result (sam1F, 449 uniques, k=5)
Notes
#[cfg(test)] mod), no shipping-code or CLI surface; same idiom as the bench/wfa_diagnoseharnesses.Test plan
DADA2RS_KDIST_INPUT=derep.json cargo test --release --bins kdist_calibrate -- --ignored --nocapture→ writeskdist.csv+ summary. Validated on a sam1F derep.🤖 Generated with Claude Code