Ontology-Grounded Knowledge Graph Construction for Alzheimer’s Disease Literature Using Multi-Model Ensemble Embeddings

Published: 28 Apr 2026, Last Modified: 28 Apr 2026MSLD 2026 PosterEveryoneRevisionsCC BY 4.0
Keywords: Alzheimer's Disease, Biomedical ontologies, Knowledge Graphs
TL;DR: A six-stage pipeline extracts 19,885 Alzheimer's-relevant concepts from 100,000 PubMed abstracts using transformer ensembles and ontology alignment, yielding 6,795 validated triples that improve factual question answering over base models.
Abstract: Research on Alzheimer’s disease produces an increasing body of literature, which makes automated knowledge extraction crucial for synthesizing findings across studies. And although biomedical ontologies offer structured vocabularies that are capable of anchoring such extraction efforts, aligning unstructured text with formal ontology terms remains challenging due to surface variation, emerging terminology, and domain-specific lexical ambiguity. To mitigate these challenges, we present a six-stage pipeline for automated knowledge graph construction from AD biomedical literature, that is grounded in the Alzheimer’s Disease Ontology (ADO). ADO is a structured, curated vocabulary of 4,616 terms that encode concepts related to AD pathology, biomarkers, genetics, and clinical phenotypes. To that end, we compiled 100,847 PubMed abstracts filtered for AD and dementia relevance. Using scispaCy’s biomedical NLP models [1], we extracted 805K candidate phrases to which we applied fast fuzzy deduplication via RapidFuzz [2] to consolidate orthographic and hyphenation variants at a 90% similarity threshold. We then employed a weighted ensemble of four transformer-based embedding models [3], i.e., Gemma Medical, BGE-large-en [4], Qwen2.5-Coder, and BioBERT-NLI [5], in order to generate a shared 768-dimensional representation of candidate phrases, which we indexed with FAISS [6] for efficient nearest-neighbor retrieval and semantic alignment to the ADO vocabulary. Phrases are then stratified into high-confidence matches (similarity ≥ 0.83) and medium-confidence candidates, which are passed to a UMAP–HDBSCAN clustering stage [7, 8] to reveal latent concept groupings that may warrant ontology expansion. Semantic alignment yielded 15,057 high-confidence phrase–term pairs, while 76,233 medium-confidence candidates (similarity 0.75–0.83) were deferred to the clustering stage for validation. To assess cluster quality, we employed a Cluster Coherence Score (CCS), which is a composite measure integrating internal semantic similarity, lexical diversity, part-of-speech composition, and named entity density—addressing known limitations of standard metrics such as silhouette score [9] and the Davies–Bouldin index [10] in semantic embedding spaces. Clusters meeting a CCS ≥ 0.62 and ontology similarity ≥ 0.70 were auto-accepted, yielding 4,828 additional clusters, which leads to a total of 19,885 actionable concepts. To construct the knowledge graph from these cleaned concepts, we employed GPT-4o-mini for large-scale relationship extraction under a controlled inventory of 41 biomedical relation types spanning causal, treatment, measurement, biological, and clinical relations. This produced 6,795 validated triples. Preliminary evaluation on a synthetic quiz-style benchmark comparing base, fine-tuned, and KG-augmented model variants demonstrates that KG retrieval substantially improves factual question answering, particularly for definition-style questions, which suggests that ontology grounded structured context supplies domain knowledge that is not strongly encoded in model parameters alone.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 46
Loading