Automated Task-Informed Document Retrieval on the COVID-19 Open Research Dataset Using Topic Modeling

Christine R Herlihy; Yuelin Liu

Automated Task-Informed Document Retrieval on the COVID-19 Open Research Dataset Using Topic Modeling

Christine R Herlihy, Yuelin Liu

12 Aug 2020 (modified: 24 May 2023)Submitted to NLP-COVID19-EMNLPReaders: Everyone

Keywords: topic modeling, unsupervised learning, text mining, document retrieval, medical research literature, clinical informatics

TL;DR: We compare keyword- and topic modeling-based approaches to document retrieval on CORD-19 and provide recommendations regarding task creation for future crowd-sourced, unsupervised retrieval efforts.

Abstract: The COVID-19 Open Research Dataset (CORD-19), a fast-growing collection of biomedical research literature, was made available in March 2020 as an effort to facilitate research addressing the COVID-19 pandemic, along with 17 tasks specifying the research questions of interest. We propose an automated, task-informed document retrieval framework for CORD-19 that leverages latent factors learned through topic models to select a set of research articles most relevant to the tasks at hand. Compared to naïve keyword-based approaches, our approach broadens the qualification of relevance from the presence of specific terms to the activity of latent topics. We show that our approach provides an overlapping yet notably different set of selections, as the latent factors account for meaningful document-wise co-occurrences that individual keywords fail to capture. Upon both qualitative and quantitative examination of retrieval results, we further provide recommendations regarding task creation intended for unsupervised document retrieval in large, heterogeneous, natural language datasets.

6 Replies

Loading