ETBTRank: Ranking Biterms in Paper Titles for Emerging Topic Discovery

Junfeng Wu, Guangyan Huang, Roozbeh Zarei

Published: 01 Jan 2022, Last Modified: 19 Jun 2024AI 2022EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Emerging topics, which often originate from the collaboration of two scientific subfields, can be represented by biterms (pairs of terms) where each term represents a distinct subfield. However, it is challenging to automatically find such two critical terms to represent an emerging topic exactly. First, existing term weighting models (such as TF-IDF, TextRank, RAKE, KECNW, and YAKE) may be effective for finding critical single-terms but not for critical biterms. Second, a potential biterm that may be suitable to represent the emerging topic has very low occurrences in a text (e.g., a corpus comprised of paper titles). So, even we combine two terms to generate a bag of biterms, the above term weighting models are still invalid, which will filter out these rare potential biterms. This paper proposes a novel Emerging Topic BiTerm Rank (ETBTRank) model to help automatically extract biterms for representing emerging topics, distinguishing emerging-topic biterms from unimportant biterms. In ETBTRank, we separately weigh the two terms in a biterm and find the emerging-topic biterms by a rule: if a biterm itself is rare, but each of the two terms in it has a high weight, then it is an emerging topic biterm. Experimental studies on paper title datasets demonstrate the effectiveness of the proposed model.