Dk-behrt: Teaching language models international classification of disease (icd) codes using known disease descriptions

Ulzee An, Simon A. Lee, Moonseong Jeong, Aditya Gorla, Jeffrey N Chiang, Sriram Sankararaman

Published: 22 Apr 2025, Last Modified: 02 Aug 2025Proceedings of The First AAAI Bridge Program on AI for Medicine and HealthcareEveryoneCC BY 4.0

Abstract: The widespread digitization of healthcare and patient data has created new opportunities to explore machine learning techniques for improving patient care. The sheer scale of this data has particularly motivated the use of deep learning methods like BERT, which can learn robust representations of medical concepts from patient data without the need for direct supervision. Simultaneously, recent research has shown that language models (LMs) trained on scientific literature can capture strong domain-specific knowledge, including concepts highly relevant to healthcare. In this paper, we leverage two complementary sources of information—patient medical records and descriptive clinical text—to learn complex clinical concepts, such as diagnostic codes, more effectively. Although significant strides have been made in using language models with each data type individually, few studies have explored whether the domain expertise acquired from scientific text can provide a beneficial inductive bias when applied to learning from patient records. To address this gap, we propose the Domain Knowledge BEHRT (DK-BEHRT), a model that integrates disease description embeddings from domain-specific language models, like BioGPT, into the attention mechanisms of a BERT-based architecture. By incorporating these “knowledge” embeddings, we aim to enhance the model’s ability to understand the clinical concept (e.g. ICD Codes) more effectively and predict clinical outcomes with higher accuracy. We validate this approach on the MIMIC-IV dataset and find that incorporating specialized embeddings consistently improves predictive accuracy for clinical outcomes compared to using generic embeddings or training the base model from scratch.