From Evidence to Knowledge: A Hierarchical Probabilistic Model of the Scientific Knowledge Landscape at Web Scale

Published: 29 Sept 2025, Last Modified: 24 Oct 2025NeurIPS 2025 - Reliable ML WorkshopEveryoneRevisionsBibTeXCC BY 4.0
Keywords: reliable machine learning, missing data, scientific literature mining, large language models, probabilistic modeling, hierarchical Gaussian mixture models, Jensen-Shannon divergence, literature-based discovery, knowledge graphs
TL;DR: We turn noisy LLM-extracted scientific claims into a relaible hierarchical probabilistic model that predicts unseen relations, flags out-of-consensus papers, and accelerates discovery.
Abstract: Scientific literature contains essential but often fragmented and conflicting evidence, a permanent challenge brought into focus by the emergence of Large Language Models (LLMs) that can read and extract information at web-scale. Traditional methods for knowledge integration rely on knowledge graphs that treat extracted statements as deterministic facts, imposing rigid assumptions such as the closed-world assumption and independence of relationships, which fail to capture uncertainty or reconcile contradictions. We introduce a shift from deterministic fact aggregation to a probabilistic framework that models article-level evidence as noisy, partial observations of a latent hierarchical structure. Applied to a biomedical corpus, our method synthesizes article-level evidence to form stable and biologically coherent clusters, indicating that stable signals can be extracted even when inputs are sparse, biased, or unreliable.
Submission Number: 202
Loading