Cardiology record multi-label classification using latent Dirichlet allocation

Jorge Pérez, Alicia Pérez, Arantza Casillas, Koldo Gojenola

2018 (modified: 31 May 2021)Comput. Methods Programs Biomed. 2018Readers: Everyone

Abstract: Highlights • Background: Electronic Health Records (EHRs) comprise valuable in- formation that should be represented in a compact though comprehensive way for automatic processing. • Methods: Latent Dirichlet Allocation allows to detect latent topics from unannotated data and represent EHRs by means of multinomial distribu- tions. • Contributions: 1. assessment of the topic models; 2. application of topic models in a multi-label classi cation task, i.e. assigning sets of ICD codes to EHRs. • Results: topic models allowed to represent documents in a continuous space obtained from unsupervised methods and they were proven to con- vey meaningful information that correlates with the ICD. Abstract Background and Objectives Electronic health records (EHRs) convey vast and valuable knowledge about dynamically changing clinical practices. Indeed, clinical documentation entails the inspection of massive number of records across hospitals and hospital sections. The goal of this study is to provide an efficient framework that will help clinicians explore EHRs and attain alternative views related to both patient-segments and diseases, like clustering and statistical information about the development of heart diseases (replacement of pacemakers, valve implantation etc.) in co-occurrence with other diseases. The task is challenging, dealing with lengthy health records and a high number of classes in a multi-label setting. Methods LDA is a statistical procedure optimized to explain a document by multinomial distributions on their latent topics and the topics by distributions on related words. These distributions allow to represent collections of texts into a continuous space enabling distance-based associations between documents and also revealing the underlying topics. The topic models were assessed by means of four divergence metrics. In addition, we applied LDA to the task of multi-label document classification of EHRs according to the International Classification of Diseases 10th Clinical Modification (ICD-10). The set of EHRs had assigned 7 codes on average over 970 different codes corresponding to cardiology. Results First, the discriminative ability of topic models was assessed using dissimilarity metrics. Nevertheless, there was an open question regarding the interpretability of automatically discovered topics. To address this issue, we explored the connection between the latent topics and ICD-10. EHRs were represented by means of LDA and, next, supervised classifiers were inferred from those representations. Given the low-dimensional representation provided by LDA, the search was computationally efficient compared to symbolic approaches such as TF-IDF. The classifiers achieved an average AUC of 77.79. As a side contribution, with this work we released the software implemented in Python and R to both train and evaluate the models. Conclusions Topic modeling offers a means of representing EHRs in a small dimensional continuous space. This representation conveys relevant information as hidden topics in a comprehensive manner. Moreover, in practice, this compact representation allowed to extract the ICD-10 codes associated to EHRs.

0 Replies