Automatic publication classifier for W. M. Keck Observatory.
Install in development mode:
pip install -e .Create .env at the project root:
MONGO_SERVER=hostname
MONGO_PORT=27017
MONGO_USER=username
MONGO_PWD=password
Copy the tracked config templates to their local (gitignored) counterparts, then edit as needed:
cp config/models.default.yaml config/models.yaml
cp config/article_subset.default.yaml config/article_subset.yamlArticles live in MongoDB. Full text stays on the filesystem (data/pubs/full_text/). Predictions write directly back to MongoDB.
Reads bibcodes and links_data from MongoDB, downloads PDFs, extracts text to data/pubs/full_text/{year}/{bibcode}.txt.
python src/data/fetch_full_text.py --collection test_articles --year 2024
python src/data/fetch_full_text.py --collection test_articles --start-year 2020 --end-year 2025Loads articles from MongoDB, derives keck_manual from the affiliation field ("keck" → positive, everything else → negative), merges full text from the filesystem, and runs the standard train/test split.
python src/scripts/train.py transformer --year 2000-2023 --save
python src/scripts/train.py embedding --collection test_articles
python src/scripts/train.py transformer --no-test --save # train on all labeled dataDocs without an affiliation set are skipped and reported in the run summary.
Available models: transformer, embedding, snippet (rule-based), llm. Hyperparameters in config/models.yaml.
Train a base model once on the full reviewed history, then fine-tune on newly reviewed years as they arrive. The reviewed-subset filter lives in config/article_subset.yaml (copied from config/article_subset.default.yaml; currently: 2020–2024 excluding from_broad_query=true; 2025+ included wholesale).
# 1. Base model — full reviewed history
python src/scripts/train.py transformer --year 2000-2025 --collection articles --save
# 2. Fine-tune once 2025 data is reviewed
python src/scripts/train.py transformer --year 2020-2025 --collection articles \
--save --finetune [BASE MODEL] --subset-articlesWhen 2026 data is reviewed, extend the year range (e.g. --year 2020-2026) and update config/article_subset.yaml to match, then fine-tune again.
To seed a fresh collection with broad-query training examples, see scratch/insert_training_data.py.
Loads articles from MongoDB, merges full text from filesystem, runs classifiers, writes predictions back to MongoDB as flat fields (ilabel, keck_score, idrp, drp_reason, ikoa, koa_reason).
# Keck classification (transformer)
python -m src.scripts.predict 2024 --collection test_articles --task keck
# DRP classification (LLM, runs on keck-positive papers only)
python -m src.scripts.predict 2024 --collection test_articles --task drp
# KOA classification (LLM)
python -m src.scripts.predict 2024 --collection test_articles --task koa