HemaRAG: A Retrieval-Augmented Generation System for Medical Question Answering in Hematologic Malignancies

Published: 19 Aug 2025, Last Modified: 12 Oct 2025BHI 2025EveryoneRevisionsBibTeXCC BY 4.0
Confirmation: I have read and agree with the IEEE BHI 2025 conference submission's policy on behalf of myself and my co-authors.
Keywords: Retrieval-Augmented Generation, RAG, LLMs, Question Answering, PubMed, BioASQ, Clinical NLP, Hematology, Hematologic Malignancies
TL;DR: HemaRAG, a biomedical question-answering system for hematologic malignancies that combines ontology-enriched retrieval with a fine-tuned language model to generate accurate, semantically rich answers from PubMed data.
Abstract: Answering complex medical questions requires both reliable information retrieval and the ability to generate responses that are medically accurate and contextually appropriate. In this paper, we present HemaRAG, a Retrieval-Augmented Generation (RAG) system designed specifically for hematologic malignancies. Our system combines a dense retriever enhanced with biomedical ontologies and a fine-tuned large language model (Gemma 3), trained locally on domain-specific literature and question–answer pairs. To build a robust retrieval base, we enriched PubMed abstracts and curated datasets such as BioASQ and PubMedQA using synonym mappings from MeSH, NCIT, DOID, and UMLS. We used a local vector database to support high-speed semantic search without sharing data externally. Evaluation across both BioASQ and long-form PubMedQA benchmarks showed high semantic accuracy (BERTScore: 87–89%), strong lexical overlap (F1: 49–52%), and high retrieval performance (Recall@10: 94–96%), despite the challenges posed by free-form medical questions. The system was developed and deployed entirely locally making it suitable for clinical contexts where patient data privacy is essential. In future work, we plan to integrate HemaRAG into an empathetic conversational agent designed to support patients and clinicians in the field of hematologic oncology.
Track: 2. Bioinformatics
Registration Id: V3N4PKQ6DZ3
Submission Number: 307
Loading