feat(ncbivirus): native local filters + Filtered completeness; fix multi-page total reconciliation#57
Merged
Merged
Conversation
…lti-page total reconciliation The NCBI Datasets v2 virus API cannot evaluate sequence-length, collection-date range, ambiguous-character, or lab-passaged predicates server-side — exactly the filters VirBench's hard queries exercise. Previously such a query could only be run partially (post-hoc filtering outside the engine). This adds them as deterministic LOCAL predicates: the connector scans the COMPLETE, count-reconciled candidate set and applies the predicates exhaustively, reporting a new FilteredState whose AuthoritativeCount is the EXACT match count (computed over a whole reconciled scan, not sampled). The caller's limit caps only the returned records, so `search --limit 1` still yields the exact total. The ambiguous-character count uses the sequence from a new efetch endpoint (the sequence never enters a RawRecord or the hash; it is immutable per accession.version, so the count is reproducible). Cheap report-only predicates run first; sequences are fetched only for survivors. Also fixes a latent multi-page reconciliation bug surfaced by this work: NCBI returns x-ncbi-total-count ONLY on the first page of a token walk; continuation pages omit it (but keep x-datasets-version). The walk previously read that absence as total=0 / a mid-walk mutation and downgraded EVERY >1000-record query to best-effort. The page-0 total is now authoritative for the whole walk; only a PRESENT, differing total signals a real mutation. (Prior multi-page fixtures set the total on every page, masking this; a new test reflects real NCBI behavior.) And corrects a latent contract bug: CompletenessState now serializes as its string token (matching schema/manifest.schema.json) rather than the raw uint8 ordinal, with a backward-compatible reader. The manifest is not content-addressed, so no snapshot hash changes. Review fixes (cross-model: codex/GPT-5.5 + adversarial): reject a repeated (field, operator) pair (last-write-wins could relax a bound and corrupt the exact count); validate calendar days per-month with leap years in normalizeDate; and propagate the FASTA scanner error so a truncated read cannot undercount N. - engine/contract: FilteredState + NewFiltered + IsExact + JSON token codec - connectors/ncbivirus: sequence_length, collection_date (gte/lte), max_ambiguous_chars, exclude_lab_passaged; multi-operator-per-field rule table; efetch endpoint; first-page-total reconciliation fix - schema: add "filtered" to the completeness state enum + reason rule - case-studies/ebola: refresh filter_logic_hash (semantics changed; intended drift) Tests cover the full local-filter path, the multi-page total-header behavior, predicate units, and the contract codec — all proven-by-revert. make ci green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add an optional, off-the-frozen-Connector SequenceExporter capability so `export --output-format fasta` materializes the verified set's sequences through the engine instead of a side fetch. ncbivirus implements ExportFASTA via batched efetch, emitting canonical deterministic FASTA (>ACCESSION + wrapped residues, in canonical record order) under strict complete-or-fail: every requested accession must come back exactly once or it errors. snapshot.Store.WriteExport writes the artifact atomically under <root>/exports/ (key/ext path-traversal guarded); resolver.MaterializeExport type-asserts the capability for sequence formats and keeps the legacy locator for the rest. ctx threaded for cancellation. Cross-model reviewed (codex/GPT-5.5): no P0; all findings folded in (set verification, canonical bytes, ctx, ext validation, empty-before-efetch). 20 new tests, key guarantees proven by revert. Live-validated: the exact ZEBOV query exports 261 records, byte-identical residues to the reference set. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What & why
Makes the NCBI Virus connector run the exact kind of query VirBench's hard cases exercise — sequence-length, collection-date range, ambiguous-character ceiling, and lab-passaged exclusion — natively and deterministically, instead of only partially (post-hoc filtering outside the engine).
The NCBI Datasets v2 API cannot evaluate these server-side (the
dataset_reportexposeslength,collection_date,isolate.name, and asequence_hash, but no ambiguous-base count and no server filter for these fields). So the connector applies them as local predicates over the COMPLETE, count-reconciled candidate set and reports a newfilteredcompleteness whoseauthoritative_countis the exact match count (a whole exhaustive scan, not a sample). This is strictly stronger thangget virus(which downloads + filters locally with no reconciliation).New
FilteredStatecompletenessengine/contract:FilteredState+NewFiltered(matched, returned, reason)+IsExact().ReconciledCount= records returned;AuthoritativeCount= exact match count (they differ only when an output limit truncates the returned records — sosearch --limit 1still reports the true total).filteredresult is served live (deterministic, re-derivable by re-running) — it is not content-addressed as a reproducible snapshot, so the "snapshot ⇒ reconciled-complete" invariant Verify trusts is untouched.Connector
sequence_length(gte/lte),collection_date(gte/lte, interval-overlap semantics for partial dates),max_ambiguous_chars(lte),exclude_lab_passaged(eq).(field, operator)rejected.efetchendpoint supplies the sequence for the ambiguous-base count only — it never enters aRawRecordor the logical hash (sequences are immutable peraccession.version, so the count is reproducible). Cheap report-only predicates run first; sequences are fetched only for survivors.Bug fixes surfaced by this work
x-ncbi-total-countonly on the first page of a token walk; continuation pages omit it. The walk previously read that absence astotal=0/ a mid-walk mutation and downgraded every >1000-record query to best-effort. The page-0 total is now authoritative for the whole walk; only a present, differing total is a real mutation. (Prior multi-page fixtures set the total on every page, masking this — a new test reflects real NCBI behavior.)CompletenessStatenow marshals as its string token (matchingschema/manifest.schema.json) rather than the rawuint8ordinal, with a backward-compatible reader. The manifest is not content-addressed, so no snapshot hashes change.Review
Cross-model reviewed (codex/GPT-5.5 independent pass + an adversarial Sonnet pass) — both SHIP, no P0. Their P1/P2 findings are folded in: reject duplicate
(field, op); per-month/leap-year day validation innormalizeDate; propagate the FASTA scanner error.Validated live
pinakes searchagainst live NCBI returnsstate: "filtered"with exact dual counts (include vs exclude proven), the efetch ambiguous-base path works, and the full ZEBOV human query (taxid 3052462, 2014 outbreak window) reconciles natively.Tests
Full local-filter path (Filtered exact count; limit bounds output not count; efetch only for report-survivors; lab-passaged; unreconciled candidate → best-effort), the multi-page total-header behavior, predicate units, and the contract codec/round-trip — all proven-by-revert.
make cigreen; govulncheck clean.case-studies/ebola: refreshedfilter_logic_hash(the connector's published semantics changed — intended drift signal).🤖 Generated with Claude Code