Skip to content

feat(ncbivirus): native local filters + Filtered completeness; fix multi-page total reconciliation#57

Merged
001TMF merged 2 commits into
mainfrom
feat/ncbivirus-local-filters
Jun 15, 2026
Merged

feat(ncbivirus): native local filters + Filtered completeness; fix multi-page total reconciliation#57
001TMF merged 2 commits into
mainfrom
feat/ncbivirus-local-filters

Conversation

@001TMF

@001TMF 001TMF commented Jun 15, 2026

Copy link
Copy Markdown
Owner

What & why

Makes the NCBI Virus connector run the exact kind of query VirBench's hard cases exercise — sequence-length, collection-date range, ambiguous-character ceiling, and lab-passaged exclusion — natively and deterministically, instead of only partially (post-hoc filtering outside the engine).

The NCBI Datasets v2 API cannot evaluate these server-side (the dataset_report exposes length, collection_date, isolate.name, and a sequence_hash, but no ambiguous-base count and no server filter for these fields). So the connector applies them as local predicates over the COMPLETE, count-reconciled candidate set and reports a new filtered completeness whose authoritative_count is the exact match count (a whole exhaustive scan, not a sample). This is strictly stronger than gget virus (which downloads + filters locally with no reconciliation).

New FilteredState completeness

  • engine/contract: FilteredState + NewFiltered(matched, returned, reason) + IsExact(). ReconciledCount = records returned; AuthoritativeCount = exact match count (they differ only when an output limit truncates the returned records — so search --limit 1 still reports the true total).
  • A filtered result is served live (deterministic, re-derivable by re-running) — it is not content-addressed as a reproducible snapshot, so the "snapshot ⇒ reconciled-complete" invariant Verify trusts is untouched.

Connector

  • sequence_length (gte/lte), collection_date (gte/lte, interval-overlap semantics for partial dates), max_ambiguous_chars (lte), exclude_lab_passaged (eq).
  • Multi-operator-per-field rule table (a two-sided range = two different operators on one field); duplicate (field, operator) rejected.
  • New efetch endpoint supplies the sequence for the ambiguous-base count only — it never enters a RawRecord or the logical hash (sequences are immutable per accession.version, so the count is reproducible). Cheap report-only predicates run first; sequences are fetched only for survivors.

Bug fixes surfaced by this work

  • Multi-page reconciliation (latent, affects the base connector): NCBI returns x-ncbi-total-count only on the first page of a token walk; continuation pages omit it. The walk previously read that absence as total=0 / a mid-walk mutation and downgraded every >1000-record query to best-effort. The page-0 total is now authoritative for the whole walk; only a present, differing total is a real mutation. (Prior multi-page fixtures set the total on every page, masking this — a new test reflects real NCBI behavior.)
  • Completeness serialization: CompletenessState now marshals as its string token (matching schema/manifest.schema.json) rather than the raw uint8 ordinal, with a backward-compatible reader. The manifest is not content-addressed, so no snapshot hashes change.

Review

Cross-model reviewed (codex/GPT-5.5 independent pass + an adversarial Sonnet pass) — both SHIP, no P0. Their P1/P2 findings are folded in: reject duplicate (field, op); per-month/leap-year day validation in normalizeDate; propagate the FASTA scanner error.

Validated live

pinakes search against live NCBI returns state: "filtered" with exact dual counts (include vs exclude proven), the efetch ambiguous-base path works, and the full ZEBOV human query (taxid 3052462, 2014 outbreak window) reconciles natively.

Tests

Full local-filter path (Filtered exact count; limit bounds output not count; efetch only for report-survivors; lab-passaged; unreconciled candidate → best-effort), the multi-page total-header behavior, predicate units, and the contract codec/round-trip — all proven-by-revert. make ci green; govulncheck clean.

  • case-studies/ebola: refreshed filter_logic_hash (the connector's published semantics changed — intended drift signal).

🤖 Generated with Claude Code

…lti-page total reconciliation

The NCBI Datasets v2 virus API cannot evaluate sequence-length, collection-date
range, ambiguous-character, or lab-passaged predicates server-side — exactly the
filters VirBench's hard queries exercise. Previously such a query could only be
run partially (post-hoc filtering outside the engine).

This adds them as deterministic LOCAL predicates: the connector scans the
COMPLETE, count-reconciled candidate set and applies the predicates exhaustively,
reporting a new FilteredState whose AuthoritativeCount is the EXACT match count
(computed over a whole reconciled scan, not sampled). The caller's limit caps only
the returned records, so `search --limit 1` still yields the exact total. The
ambiguous-character count uses the sequence from a new efetch endpoint (the
sequence never enters a RawRecord or the hash; it is immutable per
accession.version, so the count is reproducible). Cheap report-only predicates run
first; sequences are fetched only for survivors.

Also fixes a latent multi-page reconciliation bug surfaced by this work: NCBI
returns x-ncbi-total-count ONLY on the first page of a token walk; continuation
pages omit it (but keep x-datasets-version). The walk previously read that absence
as total=0 / a mid-walk mutation and downgraded EVERY >1000-record query to
best-effort. The page-0 total is now authoritative for the whole walk; only a
PRESENT, differing total signals a real mutation. (Prior multi-page fixtures set
the total on every page, masking this; a new test reflects real NCBI behavior.)

And corrects a latent contract bug: CompletenessState now serializes as its string
token (matching schema/manifest.schema.json) rather than the raw uint8 ordinal,
with a backward-compatible reader. The manifest is not content-addressed, so no
snapshot hash changes.

Review fixes (cross-model: codex/GPT-5.5 + adversarial): reject a repeated
(field, operator) pair (last-write-wins could relax a bound and corrupt the exact
count); validate calendar days per-month with leap years in normalizeDate; and
propagate the FASTA scanner error so a truncated read cannot undercount N.

- engine/contract: FilteredState + NewFiltered + IsExact + JSON token codec
- connectors/ncbivirus: sequence_length, collection_date (gte/lte),
  max_ambiguous_chars, exclude_lab_passaged; multi-operator-per-field rule table;
  efetch endpoint; first-page-total reconciliation fix
- schema: add "filtered" to the completeness state enum + reason rule
- case-studies/ebola: refresh filter_logic_hash (semantics changed; intended drift)

Tests cover the full local-filter path, the multi-page total-header behavior,
predicate units, and the contract codec — all proven-by-revert. make ci green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@001TMF 001TMF added the spine-change Authorizes edits to the frozen contract spine (engine/, idl/, schema) — CONTRACTS.md A.2 label Jun 15, 2026
Add an optional, off-the-frozen-Connector SequenceExporter capability so
`export --output-format fasta` materializes the verified set's sequences through
the engine instead of a side fetch. ncbivirus implements ExportFASTA via batched
efetch, emitting canonical deterministic FASTA (>ACCESSION + wrapped residues, in
canonical record order) under strict complete-or-fail: every requested accession
must come back exactly once or it errors. snapshot.Store.WriteExport writes the
artifact atomically under <root>/exports/ (key/ext path-traversal guarded);
resolver.MaterializeExport type-asserts the capability for sequence formats and
keeps the legacy locator for the rest. ctx threaded for cancellation.

Cross-model reviewed (codex/GPT-5.5): no P0; all findings folded in (set
verification, canonical bytes, ctx, ext validation, empty-before-efetch). 20 new
tests, key guarantees proven by revert. Live-validated: the exact ZEBOV query
exports 261 records, byte-identical residues to the reference set.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@001TMF 001TMF merged commit 72270f3 into main Jun 15, 2026
3 checks passed
@001TMF 001TMF deleted the feat/ncbivirus-local-filters branch June 15, 2026 18:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

spine-change Authorizes edits to the frozen contract spine (engine/, idl/, schema) — CONTRACTS.md A.2

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant