Document Embeddings Enhance Biomedical Retrieval-Augmented Generation

Yongle Kong, Zhihao Yang, Ling Luo, Zeyuan Ding, Lei Wang, Wei Liu, Yin Zhang, Bo Xu, Jian Wang, Yuanyuan Sun, Zhehuan Zhao, Hongfei Lin

Published: 01 Jan 2024, Last Modified: 28 Jul 2025BIBM 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Large language models (LLMs) perform well in many NLP tasks but frequently generate inaccurate information in the biomedical domain, due to hallucination issues. Retrieval-Augmented Generation (RAG) has been introduced to address this issue by integrating external knowledge, enhancing the factual accuracy of outputs. However, naive RAG encounters challenges in effectively utilizing retrieved content, particularly in specialized domains like biomedicine. LLMs often struggle to integrate retrieved content as irrelevant information can interfere with the model’s judgment. Even if relevant documents are retrieved, the model may be unable to accurately comprehend and utilize the domain-specific features due to its inherent knowledge limitations. To overcome these limitations, we propose Document Embeddings Enhanced Biomedical RAG (DEEB-RAG), a framework that incorporates document embeddings along with the original retrieved text. DEEB-RAG uses MedCPT to generate document embeddings and these embeddings are then aligned with the LLM’s semantic space using a two-stage training process on a simple projector. Experimental results on biomedical QA datasets show that DEEB-RAG improves accuracy, with an average performance increase of 2.3% over naive RAG. This demonstrates DEEB-RAG’s ability to mitigate the challenges of utilizing complex biomedical information, thereby enhancing the reliability and effectiveness of LLMs in biomedical domain.