Graph-Based Retriever Captures the Long Tail of Biomedical Knowledge

Julien Delile; Srayanta Mukherjee; Anton Van Pamel; Leonid Zhukov

Graph-Based Retriever Captures the Long Tail of Biomedical Knowledge

Julien Delile, Srayanta Mukherjee, Anton Van Pamel, Leonid Zhukov

Published: 17 Jun 2024, Last Modified: 16 Jul 2024ML4LMS PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: RAG LLM, Information retrieval, knowledge graphs, biomedical literature

TL;DR: Discovering new associations between biomedical entities - drugs, genes, diseases, with LLMs and knowledge graphs

Abstract: Large language models (LLMs) are transforming the way information is retrieved with vast amounts of knowledge being summarized and presented via natural language conversations. Yet, LLMs are prone to highlight the most frequently seen pieces of information from the training set and to neglect the rare ones. In biomedical research, latest discoveries are key to academic and industrial actors and are obscured by the abundance of an ever-increasing literature corpus (the information overload problem). Surfacing new associations between biomedical entities, e.g., drugs, genes, diseases, with LLMs becomes a challenge of capturing the long-tail knowledge of the biomedical scientific production. In this study, we show that typical RAG methods may leave out a significant proportion of relevant information due to clusters of over-represented concepts in the biomedical literature. We introduce a novel method that leverages a knowledge graph to down-sample these clusters and mitigate the information overload problem. Its retrieval performance is about twice better than embedding similarity alternatives on both precision and recall. Finally, we demonstrate that both embedding similarity and knowledge graph retrieval methods can be combined into a hybrid model that outperforms both, enabling potential improvements to biomedical question-answering models.

Poster: pdf

Submission Number: 129

Loading