Topic Detection Using Paragraph Vectors to Support Active Learning in Systematic Reviews

Kazuma Hashimoto, Georgios Kontonatsios, Makoto Miwa, Sophia Ananiadou

Published: 01 Aug 2016, Last Modified: 08 Feb 2026Journal of Biomedical InformaticsEveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Systematic reviews require expert reviewers to manually screen thousands of citations in order to identify all relevant articles to the review. Active learning text classifcation is a supervised machine learning approach that has been shown to signifcantly reduce the manual annotation workload by semi-automating the citation screening process of systematic reviews. In this paper, we present a new topic detection method that induces an informative representation of studies, to improve the performance of the underlying active learner. Our proposed topic detection method uses a neural network-based vector space model to capture semantic similarities between documents. We frstly represent documents within the vector space, and cluster the docu- ments into a predefned number of clusters. The centroids of the clusters are treated as latent topics. We then represent each document as a mixture of la- tent topics. For evaluation purposes, we employ the active learning strategy using both our novel topic detection method and a baseline topic model (i.e., Latent Dirichlet Allocation). Results obtained demonstrate that our method is able to achieve a high sensitivity of eligible studies and a signifcantly re- duced manual annotation cost when compared to the baseline method. This observation is consistent across two clinical and three public health reviews.
Loading