DR.EHR: Dense Retrieval for Electronic Health Record with Knowledge Injection and Synthetic Data

ACL ARR 2025 February Submission7907 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Electronic Health Records (EHRs) are pivotal in clinical practices, yet their retrieval remains a challenge due to the reliance on exact match methods that fail to address semantic gaps. Recent advancements in dense retrieval offer promising solutions but existing models, both general-domain and biomedical-domain, fall short due to insufficient medical knowledge or mismatched training corpora. This paper introduces \texttt{DR.EHR}, a series of dense retrieval models specifically tailored for EHR retrieval. We propose a two-stage training pipeline utilizing MIMIC-IV discharge summaries to address the need for extensive medical knowledge and large-scale training data. The first stage involves medical entity extraction and knowledge injection from a biomedical knowledge graph, while the second stage employs large language models to generate diverse training data. We train two variants of \texttt{DR.EHR}, with 110M and 7B parameters, respectively. Evaluated on the CliniQ benchmark, our models significantly outperforms all existing dense retrievers, achieving state-of-the-art results. Detailed analyses confirm our models' superiority across various match and query types, particularly in challenging semantic matches like implication and abbreviation. Ablation studies validate the effectiveness of each pipeline component, underscoring the model's enhanced medical knowledge and adaptability to the EHR retrieval task. This work significantly advances EHR retrieval, offering a robust solution for clinical applications.
Paper Type: Long
Research Area: Information Retrieval and Text Mining
Research Area Keywords: EHR retrieval, knowledge injection, synthetic data, information retrieval
Languages Studied: English
Submission Number: 7907
Loading