iKPUB

Automatic publication classifier for W. M. Keck Observatory.

Setup

Install in development mode:

pip install -e .

Create .env at the project root:

MONGO_SERVER=hostname
MONGO_PORT=27017
MONGO_USER=username
MONGO_PWD=password

Copy the tracked config templates to their local (gitignored) counterparts, then edit as needed:

cp config/models.default.yaml config/models.yaml
cp config/article_subset.default.yaml config/article_subset.yaml

Articles live in MongoDB. Full text stays on the filesystem (data/pubs/full_text/). Predictions write directly back to MongoDB.

1. Fetch Full Text

Reads bibcodes and links_data from MongoDB, downloads PDFs, extracts text to data/pubs/full_text/{year}/{bibcode}.txt.

python src/data/fetch_full_text.py --collection test_articles --year 2024
python src/data/fetch_full_text.py --collection test_articles --start-year 2020 --end-year 2025

2. Train / Test

Loads articles from MongoDB, derives keck_manual from the affiliation field ("keck" → positive, everything else → negative), merges full text from the filesystem, and runs the standard train/test split.

python src/scripts/train.py transformer --year 2000-2023 --save
python src/scripts/train.py embedding --collection test_articles
python src/scripts/train.py transformer --no-test --save  # train on all labeled data

Docs without an affiliation set are skipped and reported in the run summary.

Available models: transformer, embedding, snippet (rule-based), llm. Hyperparameters in config/models.yaml.

Fine-tuning workflow

Train a base model once on the full reviewed history, then fine-tune on newly reviewed years as they arrive. The reviewed-subset filter lives in config/article_subset.yaml (copied from config/article_subset.default.yaml; currently: 2020–2024 excluding from_broad_query=true; 2025+ included wholesale).

# 1. Base model — full reviewed history
python src/scripts/train.py transformer --year 2000-2025 --collection articles --save

# 2. Fine-tune once 2025 data is reviewed
python src/scripts/train.py transformer --year 2020-2025 --collection articles \
    --save --finetune [BASE MODEL] --subset-articles

When 2026 data is reviewed, extend the year range (e.g. --year 2020-2026) and update config/article_subset.yaml to match, then fine-tune again.

To seed a fresh collection with broad-query training examples, see scratch/insert_training_data.py.

3. Predict Labels

Loads articles from MongoDB, merges full text from filesystem, runs classifiers, writes predictions back to MongoDB as flat fields (ilabel, keck_score, idrp, drp_reason, ikoa, koa_reason).

# Keck classification (transformer)
python -m src.scripts.predict 2024 --collection test_articles --task keck

# DRP classification (LLM, runs on keck-positive papers only)
python -m src.scripts.predict 2024 --collection test_articles --task drp

# KOA classification (LLM)
python -m src.scripts.predict 2024 --collection test_articles --task koa

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
config		config
src		src
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

iKPUB

Setup

1. Fetch Full Text

2. Train / Test

Fine-tuning workflow

3. Predict Labels

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

iKPUB

Setup

1. Fetch Full Text

2. Train / Test

Fine-tuning workflow

3. Predict Labels

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages