Query-Based Topic Modeling and Trend Analysis in Scientific Literature

Ahmed Tarek; Marwa Mahmoud; Basma Afifi; Maggie Mashaly; Mervat Abu-Elkheir

Query-Based Topic Modeling and Trend Analysis in Scientific Literature

Ahmed Tarek, Marwa Mahmoud, Basma Afifi, Maggie Mashaly, Mervat Abu-Elkheir

Published: 01 Jan 2024, Last Modified: 15 May 2025ICM 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: The number of scientific publications grows every year. This growth has contributed to the continuous emergence of new trends across various scientific domains where researchers and investors are eager to predict these trends in advance. Using Natural Language Processing (NLP) techniques, this study aimed to develop a topic modeling approach to identify possible emerging trendy topics in a predetermined scientific field based on a given query and a set of thousands of abstracts. The proposed approach involves abstracts preprocessing, abstracts, and query encoding with the “allenai-specter” Sentence-BERT (SBERT) model, retrieving abstracts similar to the user query, and clustering them using the Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) algorithm. Subsequently, Latent Dirichlet Allocation (LDA) is applied to each cluster for topic modeling, and the resulting topics are labeled and analyzed over time to discover their trending status. Finally, abstracts discussing these emerging topics are summarized using the OpenAI “gpt-3.5-turbo-0125” model. Our approach, combining these technologies, achieved its objectives. Tested on data from 2016 to 2023 and focused on NLP as the domain of examination, our approach accurately identified five clusters, each representing specific NLP subfields, and highlighted approximately six trending topics spanning areas such as Performance Optimization in Pretrained Language Models, Multimodal Language Processing, Sentiment Analysis, and Text Generation methods in LLMs.

Loading