Emerging Scientific Topic Discovery by Finding Infrequent Synonymous Biterms

Junfeng Wu, Guangyan Huang, Roozbeh Zarei, Jianxin Li, Guang-Li Huang, Hui Zheng, Jing He, Chi-Hung Chi

2022 (modified: 19 Jan 2023)PAKDD (1) 2022Readers: Everyone

Abstract: With the increasing information load brought by the accelerated growth of research papers, the automatic discovery of a field’s emerging scientific topics becomes vital. It enables broad applications, such as optimizing resource allocations for promising research areas, predicting future technology trends, finding knowledge gaps and new concepts, and recommending personalized research directions. However, two challenges - the rareness of emerging-topic publications and the linguistic diversity in the description of emerging topics - hinder existing text analytic methods from effectively identifying the evolving terms in emerging topics. According to our observation, an emerging topic originating from a collaboration of two sub-fields could be represented by a biterm, each term from one sub-field. In this paper, we propose a novel finding Infrequent Synonymous Biterms to discover Emerging Scientific Topics (isBEST) method to overcome the challenges. Our isBEST method reduces linguistic diversity using document-level clustering to find the linguistic variants of each key biterm. The biterms in the same cluster expressing very similar meanings are unified to the most common synonymous biterm. Then, to address the rareness issue, isBEST converts each input document into a vector of coefficients on synonymous biterms and clusters them at the corpus level with cosine similarity. In each document, larger coefficients are assigned to rarer synonymous biterms. The underlying logic is the higher chance of a rarer synonymous biterm to be an emerging topic denoted by the two terms, each from a collaborating sub-field. Experiments on two large scholarly paper datasets demonstrate the accuracy and effectiveness of our isBEST method.

0 Replies