Predicting New Concept--Object Associations in Astronomy by Mining the Literature

Jinchu Li; Yuan-Sen Ting; Alberto Accomazzi; Tirthankar Ghosal; Nesar Soorve Ramachandra

Predicting New Concept--Object Associations in Astronomy by Mining the Literature

Jinchu Li, Yuan-Sen Ting, Alberto Accomazzi, Tirthankar Ghosal, Nesar Soorve Ramachandra

Published: 03 Mar 2026, Last Modified: 26 Apr 2026ICLR 2026 Workshop FM4Science PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: AI for science, astrophysics, temporal link prediction, literature mining, knowledge graph, entity linking, matrix factorization, recommender systems

TL;DR: Forecast future astrophysical concept--object associations from the literature using an LLM-extracted, SIMBAD-grounded graph and temporal holdout evaluation.

Abstract: We construct a concept--object knowledge graph from the entire astro-ph corpus up to July 2025, using an automated pipeline to extract named astrophysical objects from OCR-processed papers, resolve them to SIMBAD identifiers, and link them to scientific concepts annotated in the source corpus. We then ask whether the historical structure of this graph can forecast new concept--object associations before they appear in print. Because the underlying concepts are derived from clustering, semantically related concepts inevitably overlap; we address this by applying an inference-time concept-similarity smoothing step uniformly to all methods. Evaluating across four temporal cutoffs on a physically meaningful subset of concepts, we find that an implicit-feedback matrix factorization model (Alternating Least Squares; ALS) with concept smoothing outperforms the strongest neighborhood baseline (KNN based on text-embedding concept similarity) by 16.8\% on NDCG@100 (0.144 vs.\ 0.123) and 19.8\% on Recall@100 (0.175 vs.\ 0.146), and exceeds the best recency heuristic by 96\% and 88\%, respectively. These results indicate that historical literature encodes predictive structure not captured by either global heuristics or local neighborhood voting, suggesting a path toward tools that could help triage follow-up targets for costly telescope resources.

Submission Number: 95

Loading