This repository accompanies the arXiv preprint:
Predicting New Concept–Object Associations in Astronomy by Mining the Literature
Jinchu Li, Yuan‑Sen Ting, Alberto Accomazzi, Tirthankar Ghosal, Nesar Ramachandra
arXiv: 2602.14335 (astro-ph.IM)
It provides the end-to-end experimental pipeline to construct a large-scale, literature-derived concept–object graph and to forecast future concept–object associations under a strict temporal evaluation protocol.
Given a cutoff year T, train on all concept–object associations observed up to T,
and evaluate how well different methods rank objects whose association with the concept first appears after T.
The default configuration reproduces:
- Results with inference-time concept smoothing (main paper results)
- Results without smoothing (ablation)
- 1. Repository overview
- 2. Required external data
- 3. Installation
- 4. Configuration
- 5. Sample workflow
- 6. Reproducibility notes
- 7. Citation
This repository implements the complete forecasting workflow:
- Strict temporal split construction (train/target cutoff protocol)
- Concept–object graph assembly from literature-derived inputs
- Concept-neighbor construction (for smoothing and embedding-based baselines)
- Training + evaluation of forecasting methods
- Stratified metric aggregation over concept subsets
Paper–concept associations and concept embeddings are sourced from:
Ting et al. (2025), AstroMLab 5: Structured Summaries and Concept Extraction for ~400,000 Astrophysics Papers
Place the following files in data/:
concepts_embeddings.npzconcepts_vocabulary.csvpapers_concepts_mapping.csvpapers_year_mapping.csv
This repository does not redistribute AstroMLab 5 data.
This repository expects mention-level LLM object extraction data:
paper_object_edges_llm_mentions.jsonl- SIMBAD name resolution cache:
simbad_name_resolution_cache_*.jsonl
Each JSONL row corresponds to a single object mention in a paper and includes (at minimum):
- normalized object name
- semantic role
- study mode
- resolved SIMBAD identifier
All concept–object edges and weights are generated dynamically from these mention-level inputs.
No precomputed weighted graph is required.
Tested with Python 3.10+.
pip install -r requirements.txtAll experiments are controlled via:
config/table1.yaml
Edge weights are computed as:
w(c,o) = log(1 + Σ_m ρ_r(m) × γ_σ(m))
where:
ρ_r(m)is the role weightγ_σ(m)is the study-mode multiplier
Weights are configurable under:
weights:
role_weight:
study_mode_mult:Changing these values changes the underlying graph and therefore the scientific question being evaluated.
Edge construction is controlled by:
edge_configs:
train:
target:These control:
- role filtering (
role_filter) - study filtering (
study_filter) - weighting scheme (
weighting) - per-paper normalization (
paper_norm) - region exclusion (
noreg) - mention-level reconstruction (
force_mentions_jsonl)
To reproduce the published results:
role_filter: all
study_filter: all
weighting: role_x_mode
paper_norm: none
noreg: trueThe evaluation assumes:
- Train and target graphs are built under identical edge semantics
- Only the temporal cutoff defines the split
- Stratification is applied after graph construction
The pipeline allows different configs for train and target, but this is not recommended for standard forecasting experiments.
Using different filters may:
- change which edges count as “seen”
- alter eligibility criteria
- introduce distribution shift
- create evaluation artifacts
For clarity and reproducibility, keep:
edge_configs.train == edge_configs.targetEdge construction can optionally filter object mentions before aggregation.
Controls which semantic roles are retained:
all— keep all object mentions (used in the paper)substantive— exclude context rolesprimary_only— retain only primary scientific targets
Context roles are defined under:
weights:
context_roles:Default context roles:
context_roles:
- comparison_or_reference
- calibration
- serendipitous_or_field_sourceIn the main experiments, role_filter: all is used, so context roles are included but typically downweighted via smaller role weights.
Controls filtering by study type:
all— retain all study modes (used in the paper)non_sim_only— exclude theory/simulation-only mentionsnew_obs_only— retain only new observational studies
Main results use:
study_filter: allWhen enabled, objects classified as sky regions or fields (based on SIMBAD object type metadata) are excluded.
This prevents non-physical spatial regions (e.g., survey fields) from behaving like astrophysical objects.
Main results use:
noreg: trueStratification (via output.strata_to_report) determines which concepts contribute to reported evaluation metrics.
Example:
output:
strata_to_report:
- physical_subset_excl_stats_sim_instrStratification is applied after graph construction, meaning:
- the graph is built over the full concept universe
- temporal splits are computed on the full graph
- stratification only filters which concepts contribute to reported metrics
- no held-out information is used during graph construction
Training on all concepts and reporting on a subset (e.g., physical concepts) is valid and used in the paper.
| Stratum name | Definition |
|---|---|
all |
All concepts in the training universe |
physical_subset_excl_stats_sim_instr |
Concepts whose high-level class is not in {Statistics & AI, Numerical Simulation, Instrumental Design} |
nonphysical_only_stats_sim_instr |
Concepts whose class is in {Statistics & AI, Numerical Simulation, Instrumental Design} |
survey_or_measurement_keyword |
Concepts whose name/description matches a survey/instrument/measurement keyword regex |
This subset is defined using a heuristic regex applied to concept names and descriptions (e.g., Gaia, SDSS, photometry, calibration).
Important considerations:
- heuristic, crude text matching
- overlaps substantially with
nonphysical_only_stats_sim_instr - not a headline result in the paper
- included primarily for diagnostics/exploration
To reproduce the paper:
output:
strata_to_report:
- physical_subset_excl_stats_sim_instrAltering strata changes only what is reported, not how the graph is constructed.
cutoffs: [2017, 2019, 2021, 2023]Temporal evaluation years.
Minimum number of prior associations required for a concept to be evaluated.
Inference-time concept smoothing parameters.
Implicit ALS hyperparameters:
- latent factors
- regularization
- iterations
- alpha
- seeds
To reproduce paper averages, use multiple seeds.
From the repository root:
bash scripts/reproduce_table1.sh config/table1.yamlThis runs:
prepare_cutoff.pysmoothing.pytrain_eval.py
Outputs:
OUT_DIR/
table1/
_global/
T=2017/
T=2019/
T=2021/
T=2023/
eval_stratified_results.csv
table1.tex
- Graph construction is deterministic given the configuration and input JSONL files.
- No pre-aggregated graph artifacts are required.
Important: altering edge construction changes the scientific object of study and should be clearly documented in derived experiments.
If you use this repository, please cite:
@misc{li2026predictingnewconceptobjectassociations,
title={Predicting New Concept-Object Associations in Astronomy by Mining the Literature},
author={Jinchu Li and Yuan-Sen Ting and Alberto Accomazzi and Tirthankar Ghosal and Nesar Ramachandra},
year={2026},
eprint={2602.14335},
archivePrefix={arXiv},
primaryClass={astro-ph.IM},
url={https://arxiv.org/abs/2602.14335},
}