Skip to content

berntpopp/gencc-link

Repository files navigation

GenCC-Link

MCP + FastAPI server that grounds gene-disease validity questions in the Gene Curation Coalition (GenCC) dataset — harmonized, aggregated, and served with consensus and conflict detection.

Research use only. Not for diagnosis, treatment, triage, patient management, or clinical decision support.

Features

  • GenCC gene-disease validity harmonized across member submitters (ClinGen, Genomics England PanelApp, Orphanet, Ambry, Invitae, Illumina, and others).
  • Strongest-classification + conflict detection — for each gene-disease pair, the strongest_classification (highest rank across submitters) and a has_conflict flag when supporting and against assertions coexist.
  • Local SQLite + FTS5 store built from the weekly GenCC bulk export — fast, deterministic, no upstream API at query time.
  • 12 MCP tools with token-efficient response_mode shaping, typed outputSchema, plain-English headlines, and ready-to-call _meta.next_commands chains (one per resolved entity) — on success and error envelopes, so recovery is deterministic.
  • Validated enum filtersfind_curations rejects out-of-vocabulary classification/submitter/moi with invalid_input and the accepted set (case-insensitive, with "did you mean"), instead of a misleading empty result. Each matched row carries a matched field naming the triggering submission.
  • Observability — every _meta carries request_id + elapsed_ms; get_gencc_diagnostics reports the daily download-quota headroom.
  • Three transports from one codebase: unified (REST + MCP), http, stdio.
  • Agent-discoverablegencc:// capabilities (with inheritance_modes, data_notes), usage, reference, license, and citation resources; typed error envelopes; full recommended_citation in full mode, or a cacheable citation_ref + one-line citation_short otherwise.

Data source & license

GenCC has no live API; data is distributed as a single bulk export.

  • Source: Gene Curation Coalition bulk submissions export (new format) from thegencc.org, ~24MB TSV, updated weekly.
  • Data license: CC0 1.0 (public domain). Attribution to GenCC and the contributing sources is requested.
  • OMIM restriction: OMIM disease text is restricted where licensing forbids, so the disease_original_* OMIM fields may be absent — this is expected.
  • Not clinical: GenCC data is not intended for direct diagnostic use or medical decision-making without review by a genetics professional.

Quick start

# Install uv if needed
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install project and dev dependencies
uv sync

# Build the local SQLite database from the GenCC export (~24MB download)
make data

# Start the unified REST + MCP server on http://127.0.0.1:8000
make dev

# Or start the local stdio MCP server (for Claude Desktop)
make mcp-serve

The database is built into <repo>/data/gencc.sqlite by default. With GENCC_LINK_DATA__AUTO_BOOTSTRAP=true (the default), the HTTP / unified server also builds the database on first use if it is absent, so make data is optional but recommended for a predictable first boot.

Database management commands:

make data          # gencc-link-data build   — force download + rebuild
make data-refresh  # gencc-link-data refresh — rebuild only if export changed
make data-info     # gencc-link-data info    — print build provenance

Connecting Claude Code & Claude Desktop

See docs/MCP_CONNECTION_GUIDE.md for the full guide. Streamable HTTP at /mcp is recommended; stdio is a local fallback.

Claude Code (HTTP)

make dev
claude mcp add --transport http gencc-link http://127.0.0.1:8000/mcp

Claude Desktop (HTTP)

{
  "mcpServers": {
    "gencc-link": {
      "type": "http",
      "url": "http://127.0.0.1:8000/mcp"
    }
  }
}

Claude Desktop (stdio)

{
  "mcpServers": {
    "gencc-link": {
      "command": "gencc-link-mcp",
      "env": {
        "PYTHONUNBUFFERED": "1",
        "GENCC_LINK_LOG_LEVEL": "WARNING"
      }
    }
  }
}

Or run stdio from a checkout with uv (no install step):

{
  "mcpServers": {
    "gencc-link": {
      "command": "uv",
      "args": ["run", "python", "mcp_server.py"],
      "cwd": "/absolute/path/to/gencc-link"
    }
  }
}

Available MCP tools

Tool Purpose
get_server_capabilities Tool inventory, classification ranks, response modes, data freshness
get_gencc_diagnostics Build provenance + row/gene/disease/submitter counts
search_genes Resolve symbol / HGNC id / partial text to genes (FTS)
search_diseases Resolve title / MONDO / OMIM id to diseases (FTS)
get_gene_curations All gene-disease assertions for a gene, with strongest classification + conflict
get_disease_curations All genes asserted for a disease, with strongest classification + conflict
get_genes_curations Batch get_gene_curations: up to 20 genes in one call (misses in unresolved)
get_diseases_curations Batch get_disease_curations: up to 20 diseases in one call (misses in unresolved)
get_gene_disease_assertion One pair: per-submitter classifications, MOI, PMIDs, URLs + conflict analysis
find_curations Filter assertions by classification/submitter/MOI/conflict (ids_only for cheap paging; cursor for refresh-safe autonomous page-forward)
list_submitters Submitting organizations + counts
resolve_identifier Map free text to canonical HGNC/MONDO ids

Tools whose payloads vary accept response_mode: minimal | compact (default) | standard | full. See docs/usage.md for the canonical workflows and the citation contract.

GeneFoundry federation

GenCC-Link is part of the GeneFoundry *-link MCP fleet, federated behind the genefoundry-router gateway. It follows Tool-Naming & Normalization Standard v1:

  • serverInfo.name: gencc-link (stable identity, set on the FastMCP instance).
  • Gateway namespace token: gencc. The router mounts this server with namespace="gencc", so its tools surface at the gateway as gencc_<tool> (e.g. gencc_search_genes). Standalone MCP clients namespace it as mcp__gencc-link__<tool>.
  • Unprefixed leaves: tool names are intentionally not server-prefixed — namespacing is the gateway's job (Rule 1), so a leaf prefix would double-prefix at the gateway. A CI guard (tests/test_tool_naming.py) enforces ^[a-z0-9_]{1,50}$ + a canonical verb (get/search/list/resolve/find/ compare/compute) + a domain tag on every registered tool.
  • Canonical arguments: gene_symbol (approved symbol) / hgnc_id (HGNC CURIE) — pass exactly one to a single-gene tool; disease (MONDO/OMIM CURIE or title); response_mode; limit/offset. The batch get_genes_curations keeps a polymorphic genes list (symbols or HGNC CURIEs).

Architecture

GenCC is small, slow-changing bulk data with no live API, so GenCC-Link builds a local SQLite + FTS5 artifact once and queries it in-process — no upstream client, rate limiting, or caching against an external API at query time.

ingest (download -> parse -> aggregate -> build) -> SQLite + FTS5 store
  -> repository (read-only) -> service (search / curations / consensus)
  -> MCP tools  +  FastAPI (/health, /, /docs)
  -> transports: unified | http | stdio

Full details, the consensus/conflict model, and an ASCII diagram are in docs/architecture.md.

Configuration

Settings load from environment variables prefixed GENCC_LINK_ (nested data config uses a double underscore) and an optional .env file. Copy .env.example and adjust. Key variables:

Variable Default Description
GENCC_LINK_HOST 127.0.0.1 Server host
GENCC_LINK_PORT 8000 Server port
GENCC_LINK_TRANSPORT unified unified | http | stdio
GENCC_LINK_MCP_PATH /mcp MCP endpoint path
GENCC_LINK_LOG_LEVEL INFO Logging level
GENCC_LINK_LOG_FORMAT console console or json
GENCC_LINK_DATA__SOURCE_FORMAT new GenCC export format (new | legacy)
GENCC_LINK_DATA__DATA_DIR <repo>/data Directory for the built database
GENCC_LINK_DATA__DB_FILENAME gencc.sqlite SQLite filename in the data dir
GENCC_LINK_DATA__AUTO_BOOTSTRAP true (image: false) Build the database lazily on first use if absent
GENCC_LINK_DATA__REFRESH_ENABLED true Run the in-app conditional-refresh scheduler (unified/http only)
GENCC_LINK_DATA__REFRESH_INTERVAL_HOURS 24 Hours between conditional refresh checks
GENCC_LINK_DATA__REFRESH_JITTER_SECONDS 300 Random jitter added to each refresh
GENCC_LINK_DATA__BUILD_LOCK_TIMEOUT 600 Seconds to wait for the cross-process build lock
GENCC_LINK_DATA__DOWNLOAD_TIMEOUT 120 Download timeout (seconds)
GENCC_LINK_DATA__CACHE_SIZE 512 Query cache entries (0 disables)
GENCC_LINK_DATA__CACHE_TTL 3600 Query cache TTL (seconds)

See docs/data-lifecycle.md for how the database is built on startup and refreshed on a schedule (in-app scheduler, cron sidecar, or Kubernetes CronJob).

Development

make install      # install project + dev dependencies (uv sync --group dev)
make ci-local     # format-check, lint, file-size budget, typecheck, fast tests
make test         # run tests (excludes integration)
make test-cov     # run tests with coverage (gate: 85%)
make lint         # ruff lint
make lint-loc     # enforce the per-file line budget (scripts/check_file_size.py)
make typecheck    # mypy strict

make ci-local is the gate to run before every commit. The project uses uv, Ruff (100 cols), mypy strict, and a per-file line budget enforced by scripts/check_file_size.py. Integration tests (-m integration) hit the live GenCC download endpoint and are excluded from the default runs. Agentic coding tools should follow AGENTS.md; Claude Code also loads the lean CLAUDE.md.

Docker deployment

make docker-build           # build the image
make docker-up              # start the unified server on host port 8000
curl http://localhost:8000/health
make docker-logs
make docker-down

The container's entrypoint builds the database once on startup (before the server accepts traffic), and an in-app scheduler conditionally refreshes it every 24h and hot-reloads the running server — so first-request latency is predictable and the daily download quota is respected. The built ~24MB database lives in the gencc-data named volume at /app/data and persists across restarts (a restart re-uses it; the conditional request returns 304).

For a dedicated scheduler instead of the in-app loop, use the cron sidecar overlay:

docker compose -f docker/docker-compose.yml -f docker/docker-compose.cron.yml up -d

Kubernetes manifests (initContainer + in-app scheduler, or an external CronJob) are in deploy/k8s/. The full strategy and all scheduling options are documented in docs/data-lifecycle.md. See docker/README.md for the production overlay.

License & citation

  • Code: MIT — see LICENSE.
  • Data: CC0 1.0 (public domain), from GenCC (thegencc.org); attribution requested.

Cite GenCC as:

DiStefano MT, et al. The Gene Curation Coalition. Genet Med. 2022;24(8):1732-1742. doi:10.1016/j.gim.2022.04.017

Acknowledgments


Research use only. GenCC-Link is a research tool and must not be used for diagnosis, treatment, triage, patient management, or clinical decision support. GenCC data is not intended for direct diagnostic use or medical decision-making without review by a genetics professional.

About

MCP + FastAPI server grounding gene-disease validity questions in Gene Curation Coalition (GenCC) data — local SQLite+FTS5, consensus & conflict detection, 10 MCP tools

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages