Skip to content

fix(connectors): honor --limit + fix gzip/taxon-id/multi-word terms (uniprot, pdb, clinvar)#56

Open
001TMF wants to merge 1 commit into
mainfrom
fix/benchmark-connectors-limit-gzip
Open

fix(connectors): honor --limit + fix gzip/taxon-id/multi-word terms (uniprot, pdb, clinvar)#56
001TMF wants to merge 1 commit into
mainfrom
fix/benchmark-connectors-limit-gzip

Conversation

@001TMF

@001TMF 001TMF commented Jun 14, 2026

Copy link
Copy Markdown
Owner

What

Three connectors (uniprot, pdb, clinvar) ignored Query.Limit and walked/fetched the full result set even for a count query (--limit 1) — causing timeouts — and each had a connector-specific bug that made benchmark/count queries fail outright. Found while wiring these sources into the multi-domain retrieval benchmark.

All three now mirror the ncbi-virus limit-as-walk-ceiling pattern (#55). Complete-or-fail is preserved: the strict limit < authority guard makes a limited walk always BestEffort with AuthoritativeCount = upstream total, so a limit can never produce a false Complete.

uniprot

  • gzip: readBody returned raw gzip bytes (UniProt gzips its JSON; it arrived undecompressed), so json.Decode failed with invalid character '\x1f'. Now decompresses on Content-Encoding: gzip or the 0x1f 0x8b magic sniff (covers both search and getEntry).
  • limit: honors Query.Limit as a walk ceiling; exposes the authoritative total on the BestEffort path.

pdb

  • organism_taxon_id → HTTP 400: mapped to rcsb_entity_source_organism.ncbi_taxonomy_id, which is search-disabled upstream. Remapped to rcsb_entity_source_organism.taxonomy_lineage.id (string in), which the RCSB Search API accepts.
  • limit: honors Query.Limit as a walk ceiling and caps getEntry to the limited id set (was fetching every entry of the full total even for --limit 1).

clinvar

  • limit: honors Query.Limit as a walk ceiling on the esearch idlist.
  • multi-word terms: quotes Entrez values containing whitespace (e.g. "Uncertain significance"). Unquoted multi-word values were mis-tokenized by esearch and silently returned 0.

Verification

  • Per-connector regression tests added, each proven non-vacuous by revert.
  • Live counts confirmed against each source's own number (e.g. ClinVar BRCA1 Pathogenic = 14,186 matches esearch exactly; UniProt human reviewed = 20,431; BRCA1 VUS 0 → 7,877 after the quoting fix).
  • make ci green; govulncheck clean.
  • Adversarial verify pass on all three diffs (PASS; no false-Complete risk). Cross-model codex gate deferred (ChatGPT usage limit) — will run on reset.

🤖 Generated with Claude Code

…uniprot, pdb, clinvar)

Three connectors ignored Query.Limit and walked/fetched the full result set even for a
count query (--limit 1), causing timeouts; each also had a connector-specific bug. Mirrors
the ncbi-virus limit-as-walk-ceiling pattern (#55); complete-or-fail is preserved (the strict
`limit < authority` guard makes a limited walk always BestEffort, never a false Complete).

uniprot:
- decompress gzip response bodies in readBody (Content-Encoding OR 0x1f8b magic sniff);
  UniProt gzips its JSON and it arrived undecompressed, failing json decode with '\x1f'.
- honor Query.Limit as a walk ceiling; expose the authoritative total on the BestEffort path.

pdb:
- map organism_taxon_id to rcsb_entity_source_organism.taxonomy_lineage.id (string `in`);
  the prior attribute is search-disabled upstream (HTTP 400 for every operator).
- honor Query.Limit as a walk ceiling; cap getEntry to the limited id set (it was fetching
  every entry of the full total even for --limit 1).

clinvar:
- honor Query.Limit as a walk ceiling on the esearch idlist walk.
- quote multi-word Entrez values ("Uncertain significance"); unquoted multi-word values were
  mis-tokenized by esearch and silently returned 0.

Per-connector regression tests added, each proven non-vacuous by revert. make ci green.
Cross-model (codex) gate deferred (ChatGPT usage limit); covered by an adversarial verify pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@001TMF 001TMF added the spine-change Authorizes edits to the frozen contract spine (engine/, idl/, schema) — CONTRACTS.md A.2 label Jun 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

spine-change Authorizes edits to the frozen contract spine (engine/, idl/, schema) — CONTRACTS.md A.2

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant