LLM PMID Checker

A system for checking whether research triples are supported by PubMed abstracts using large language models.

Overview

Given a Parquet (or TSV) file of research triples (e.g., aspirin biolink:treats_or_applied_or_studied_to_treat headache) with associated PubMed IDs, this system:

Extracts abstracts from PMIDs via NCBI E-utilities
Evaluates support using vLLM-served models with concurrent batch processing
Saves results to a SQLite database (recommended) or TSV file, preserving all input columns alongside evaluation outputs

Pre-computed Evaluation Results

Pre-computed evaluation results for the full SemMedDB KGX dataset are available as a GitHub release. If you only need the results, you can skip the setup steps below and download them directly.

Download Results

To download the pre-computed results, please use the provided download script (requires only Python stdlib):

python scripts/download_release_data.py --output-dir results --tag v1.0

After running the command above, the following files will be downloaded:

results/
    results_no_abstract.parquet
    results.parquet                           <-- use this (3.5G, 26,719,183 rows with 11 cols)
    LLM_Pmid_Evaluation_SemMedDB_v1.0.tar.gz

Explanation of Results

The release contains two Parquet files:

File	Size	Rows	Description
`results.parquet`	~3.5 GB	26,719,183	Evaluation results for triples whose PMIDs had an available abstract
`results_no_abstract.parquet`	~41 MB	1,292,499	Triples whose PMIDs had no abstract available (not evaluated)

results.parquet contains the original input columns plus LLM evaluation outputs:

Column	Type	Description
`subject_curie`	string	Subject entity CURIE (e.g., `NCBITaxon:562`)
`predicate`	string	Biolink predicate (e.g., `biolink:has_part`)
`object_curie`	string	Object entity CURIE (e.g., `NCBIGene:100`)
`PMID`	string	PubMed ID used for evaluation (e.g., `PMID:3047400`)
`SemMedDB_sentences`	string	Original SemMedDB sentence(s) for this edge (pipe-separated if multiple)
`predicted`	bool	Whether the triple is supported (`True` if `support == "yes"`)
`support`	string	LLM judgment: `yes`, `no`, or `maybe`
`subject_mentioned`	bool	Whether the subject entity is mentioned in the abstract
`object_mentioned`	bool	Whether the object entity is mentioned in the abstract
`supporting_sentences`	string	Exact sentences from the abstract that support the triple (pipe-separated if multiple)
`reasoning`	string	LLM's reasoning for the judgment

Each row represents one unique (subject_curie, predicate, object_curie, PMID) combination. The LLM reads the PubMed abstract for the given PMID, checks whether the subject and object are mentioned, and judges whether the abstract supports the stated relationship.

Dataset Statistics

Coverage:

Metric	Count
Total unique (subject, predicate, object, PMID) combinations	28,011,682
Triples with abstract available (evaluated)	26,719,183 (95.4%)
Triples without abstract (not evaluated)	1,292,499 (4.6%)
Unique PMIDs (with abstract)	12,000,111
Unique PMIDs (without abstract)	1,094,551
Unique subject CURIEs	54,920
Unique object CURIEs	47,290
Unique Biolink predicates	19

Support distribution (among 26,719,183 evaluated triples):

Support	Count	Percentage
`yes`	16,139,271	60.4%
`no`	9,447,319	35.4%
`maybe`	1,132,593	4.2%

Entity mention rates (among 26,719,183 evaluated triples):

Metric	Count	Percentage
Both subject and object mentioned	20,659,114	77.3%
Subject only mentioned	2,427,312	9.1%
Object only mentioned	2,909,116	10.9%
Neither mentioned	723,641	2.7%

How to Run

1. Install Dependencies

conda activate llm_pmid_env
pip install -r requirements.txt

2. Prepare SemMedDB KGX Data

Download the SemMedDB KGX dataset from Translator-CATRAX/SemMedDB-KGX into data/semmedb_kgx/:

cd data/semmedb_kgx
python download_semmeddb_uncapped.py

Then extract the per-PMID edges into a Parquet file from the normalized edges JSONL:

python scripts/extract_semmeddb_edges.py \
    -i data/semmedb_kgx/normalized_edges.jsonl \
    -o data/semmedb_kgx/semmeddb_edges_extracted.parquet

This groups edges by (subject_curie, predicate, object_curie, PMID). When multiple edge records share the same key but have different supporting sentences, the sentences are concatenated with " | " in the SemMedDB_sentences column. Output columns: subject_curie, predicate, object_curie, PMID, SemMedDB_sentences.

3. Node File & CURIE Names for Richer Entity Context (Recommended)

The SemMedDB KGX download (step 2) includes normalized_nodes.jsonl, which provides entity names, categories, descriptions, and equivalent identifiers for every CURIE. Combined with curie_all_names.tsv (generated via the following command), they can provide richer entity context.

python scripts/extract_curie_names.py \
    --input data/semmedb_kgx/semmeddb_edges_extracted.parquet \
    --output data/semmedb_kgx/curie_all_names.tsv \
    --batch-size 500 --max-concurrent 10

This queries the Node Normalization API for all unique CURIEs and collects every known name variant (primary label + labels from equivalent identifiers), case-insensitively deduplicated.

4. Start vLLM Server(s)

Use the provided setup script to launch one or more vLLM servers:

# GPT-OSS 20B on GPU 0
VLLM_MODEL=openai/gpt-oss-20b VLLM_MODEL_NAME=gpt-oss-20b-vllm VLLM_GPU=0 VLLM_PORT=8000 bash setup_vllm.sh

# GPT-OSS 120B on GPU 1
VLLM_MODEL=openai/gpt-oss-120b VLLM_MODEL_NAME=gpt-oss-120b-vllm VLLM_GPU=1 VLLM_PORT=9000 bash setup_vllm.sh

5. Extract Biolink Predicate Definitions (Optional)

If your input uses Biolink predicates (e.g., biolink:affects, biolink:treats_or_applied_or_studied_to_treat), extract predicate definitions from the Biolink Model YAML to provide the LLM with formal predicate semantics:

python scripts/extract_biolink_predicates.py \
    --input data/biolink_data/biolink-model.yaml \
    --output data/biolink_data/biolink_predicates.tsv

The output TSV has two columns: predicate (e.g., biolink:affects) and description. Pass it to main.py via --predicate_file.

6. Pre-fetch PMID Abstracts (Recommended)

Abstracts fetched from NCBI are automatically cached in a local SQLite database (data/pmid_cache.db). For large datasets, pre-fetch all abstracts before running evaluation to avoid rate limits during batch processing:

python scripts/prefetch_pmid_abstracts.py \
    --tsv-file data/semmedb_kgx/semmeddb_edges_extracted.parquet \
    --batch-size 200 --delay 1.0

# Force re-fetch (overwrite cached entries)
python scripts/prefetch_pmid_abstracts.py \
    --tsv-file data/semmedb_kgx/semmeddb_edges_extracted.parquet --force

If the initial fetch has transient network failures, retry only the failed PMIDs:

python scripts/retry_failed_pmids.py --batch-size 200 --delay 1.0

To diagnose cache issues:

# Find PMIDs that failed to cache (errors or missing abstracts)
python scripts/check_failed_pmids.py \
    --tsv-file data/semmedb_kgx/semmeddb_edges_extracted.parquet

# Check overall cache status
python scripts/check_cache_status.py

7. Configure Environment

Create a .env file in the project root:

# NCBI E-utilities
NCBI_EMAIL=your.email@example.com
NCBI_API_KEY=your_ncbi_api_key_here

# Batch processing
MAX_CONCURRENT_REQUESTS=5

# vLLM Configuration
VLLM_BASE_URL=http://localhost:8000

# Per-model URLs (comma-separated model=url pairs)
VLLM_MODEL_URLS=gpt-oss-20b-vllm=http://localhost:8000,gpt-oss-120b-vllm=http://localhost:8002

# Available vLLM models (must match --served-model-name used when starting vLLM)
AVAILABLE_VLLM_MODELS=gpt-oss-20b-vllm,gpt-oss-120b-vllm

8. Run Evaluation

See Usage below for full command-line options and examples.

Usage

python main.py --input INPUT_FILE --output OUTPUT_FILE [options]

Input format is auto-detected from the file extension:

.parquet / .pq → Parquet (recommended, preserves text exactly)
.tsv / .txt → Tab-separated values

Output format is auto-detected from the file extension:

.db / .sqlite / .sqlite3 → SQLite database (recommended for stop/resume)
.tsv / .txt → Tab-separated values

When using SQLite output (.db), rows without a cached abstract are written to a evaluations_no_abstract table in the same database. When using TSV output, they are written to a separate *_no_abstract.tsv file.

Flag	Description
`--input`	(required) Input file (`.parquet` or `.tsv`; must contain `subject_curie`, `predicate`, `object_curie`, `PMID`)
`--output`	(required) Output file (`.db` for SQLite recommended, `.tsv` for TSV)
`--val_model`	Validation model (default: first in `AVAILABLE_VLLM_MODELS`)
`--round2_model`	Optional Round 2 model for re-evaluating yes/maybe results
`--table`	SQLite table name, only for `.db` output (default: `evaluations`)
`--node_dict`	Nodes file (`.jsonl`, `.jsonl.gz`) for richer entity context
`--names_file`	`curie_all_names.tsv` to supplement `--node_dict` with richer equivalent names
`--predicate_file`	Biolink predicates TSV with predicate definitions (columns: `predicate`, `description`)
`--max_concurrent`	Max concurrent requests (default: `MAX_CONCURRENT_REQUESTS` from `.env`)
`--overwrite`	Discard existing output and start fresh (default: auto-resume)
`--verbose` / `-v`	Enable DEBUG logging

Stop & Resume

Results are written incrementally -- every completed row is flushed to disk immediately. You can safely Ctrl+C at any time and re-run the exact same command to resume:

# First run (or resume after interruption) -- same command each time
python main.py --input data/semmedb_kgx/semmeddb_edges_extracted.parquet --output results.db \
    --val_model gpt-oss-120b-vllm \
    --predicate_file data/biolink_data/biolink_predicates.tsv \
    --node_dict data/semmedb_kgx/normalized_nodes.jsonl \
    --names_file data/semmedb_kgx/curie_all_names.tsv

# To discard previous progress and start over
python main.py --input data/semmedb_kgx/semmeddb_edges_extracted.parquet --output results.db \
    --val_model gpt-oss-120b-vllm --overwrite \
    --predicate_file data/biolink_data/biolink_predicates.tsv \
    --node_dict data/semmedb_kgx/normalized_nodes.jsonl \
    --names_file data/semmedb_kgx/curie_all_names.tsv

On resume, the program reads the existing output, determines which (subject_curie, predicate, object_curie, PMID) rows are already evaluated, and only processes the remaining rows.

Examples

# Standard evaluation with Parquet input and SQLite output (recommended)
python main.py --input data/semmedb_kgx/semmeddb_edges_extracted.parquet --output results.db \
    --val_model gpt-oss-120b-vllm --max_concurrent 24 \
    --predicate_file data/biolink_data/biolink_predicates.tsv \
    --node_dict data/semmedb_kgx/normalized_nodes.jsonl \
    --names_file data/semmedb_kgx/curie_all_names.tsv

# Two-round evaluation (Round 1 with 20B, Round 2 with 120B)
python main.py --input data/semmedb_kgx/semmeddb_edges_extracted.parquet --output results.db \
    --val_model gpt-oss-20b-vllm --round2_model gpt-oss-120b-vllm \
    --predicate_file data/biolink_data/biolink_predicates.tsv \
    --node_dict data/semmedb_kgx/normalized_nodes.jsonl \
    --names_file data/semmedb_kgx/curie_all_names.tsv

# Write to a custom table name (useful for multiple runs in the same DB)
python main.py --input data/semmedb_kgx/semmeddb_edges_extracted.parquet --output results.db \
    --val_model gpt-oss-20b-vllm --table run_20b_v1 \
    --predicate_file data/biolink_data/biolink_predicates.tsv \
    --node_dict data/semmedb_kgx/normalized_nodes.jsonl \
    --names_file data/semmedb_kgx/curie_all_names.tsv

Input Format

The input file (Parquet or TSV) must contain these columns:

Column	Description
`subject_curie`	Subject entity CURIE (e.g., `CHEBI:70723`)
`predicate`	Relationship (e.g., `biolink:affects`, `biolink:treats_or_applied_or_studied_to_treat`)
`object_curie`	Object entity CURIE (e.g., `PR:000004517`)
`PMID`	PubMed ID to check against

Any additional columns are carried through to the output unchanged.

Output Format

Results are written to a SQLite database (.db, recommended) or a TSV file (.tsv), depending on the --output extension. Both formats contain all columns from the input plus these evaluation columns:

Column	Type	Description
`predicted`	bool	Whether the triple is supported (`support == "yes"`)
`support`	text	`yes`, `no`, or `maybe`
`subject_mentioned`	bool	Whether the subject appears in the abstract
`object_mentioned`	bool	Whether the object appears in the abstract
`supporting_sentences`	text	Exact sentences from the abstract (pipe-separated with `" \| "`)
`reasoning`	text	LLM's reasoning for the judgment

Post-evaluation Utilities

Convert SQLite to Parquet (for final delivery or analytical queries):

python scripts/convert_db_to_parquet.py --db results.db --output-dir .

This produces results.parquet (from the evaluations table) and results_no_abstract.parquet (from the evaluations_no_abstract table). The runtime_seconds column is dropped by default; use --drop-columns with no arguments to keep all columns.

Verify coverage (ensure all input rows are accounted for):

python scripts/compare_coverage.py \
    --extracted data/semmedb_kgx/semmeddb_edges_extracted.parquet \
    --results-db results.db

This reports unique 4-key counts, duplicates, overlap between tables, coverage percentage, and lists any missing or extra keys.

Available Models

Model	HuggingFace Repo
`gpt-oss-20b-vllm`	openai/gpt-oss-20b
`gpt-oss-120b-vllm`	openai/gpt-oss-120b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM PMID Checker

Overview

Pre-computed Evaluation Results

Download Results

Explanation of Results

Dataset Statistics

How to Run

1. Install Dependencies

2. Prepare SemMedDB KGX Data

3. Node File & CURIE Names for Richer Entity Context (Recommended)

4. Start vLLM Server(s)

5. Extract Biolink Predicate Definitions (Optional)

6. Pre-fetch PMID Abstracts (Recommended)

7. Configure Environment

8. Run Evaluation

Usage

Stop & Resume

Examples

Input Format

Output Format

Post-evaluation Utilities

Available Models

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
data		data
evaluation		evaluation
preliminary_version		preliminary_version
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt
setup_vllm.sh		setup_vllm.sh

Folders and files

Latest commit

History

Repository files navigation

LLM PMID Checker

Overview

Pre-computed Evaluation Results

Download Results

Explanation of Results

Dataset Statistics

How to Run

1. Install Dependencies

2. Prepare SemMedDB KGX Data

3. Node File & CURIE Names for Richer Entity Context (Recommended)

4. Start vLLM Server(s)

5. Extract Biolink Predicate Definitions (Optional)

6. Pre-fetch PMID Abstracts (Recommended)

7. Configure Environment

8. Run Evaluation

Usage

Stop & Resume

Examples

Input Format

Output Format

Post-evaluation Utilities

Available Models

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages