Abstract: In an era of rapidly increasing numbers of scientific publications, researchers face the challenge of keeping pace with field-specific advances. This paper presents methodological advancements in topic modeling by utilizing state-of-the-art language models. We introduce the AHAM methodology and a score for domain-specific adaptation of the BERTopic framework to enhance scientific text analysis. Utilizing the LLaMa2 model, we generate topic definitions through one-shot learning, with help from domain experts to craft prompts that guide literature mining by asking the model to label topics. We employ language generation and translation scores for inter-topic similarity assessment, aiming to minimize outlier topics and overlap between topic definitions. AHAM has been validated on a new corpus of scientific papers, proving effective in revealing novel insights across research areas. We also examine the impact of sentence-transformer domain adaptation on topic modeling precision, using datasets from arXiv, focusing on data size, adaptation niche, and the role of domain adaptation. Our findings indicate a significant interaction between domain adaptation and topic modeling accuracy, especially regarding outliers and topic clarity. We release our code at https://github.com/bkolosk1/aham
Loading