AHAM: Adapt, Help, Ask, Model Harvesting LLMs for Literature Mining

Boshko Koloski; Nada Lavrac; Bojan Cestnik; Senja Pollak; Blaz Skrlj; Andrej Kastrin

AHAM: Adapt, Help, Ask, Model Harvesting LLMs for Literature Mining

Boshko Koloski, Nada Lavrac, Bojan Cestnik, Senja Pollak, Blaz Skrlj, Andrej Kastrin

Published: 01 Jan 2024, Last Modified: 20 May 2025IDA (1) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: In an era of rapidly increasing numbers of scientific publications, researchers face the challenge of keeping pace with field-specific advances. This paper presents methodological advancements in topic modeling by utilizing state-of-the-art language models. We introduce the AHAM methodology and a score for domain-specific adaptation of the BERTopic framework to enhance scientific text analysis. Utilizing the LLaMa2 model, we generate topic definitions through one-shot learning, with help from domain experts to craft prompts that guide literature mining by asking the model to label topics. We employ language generation and translation scores for inter-topic similarity assessment, aiming to minimize outlier topics and overlap between topic definitions. AHAM has been validated on a new corpus of scientific papers, proving effective in revealing novel insights across research areas. We also examine the impact of sentence-transformer domain adaptation on topic modeling precision, using datasets from arXiv, focusing on data size, adaptation niche, and the role of domain adaptation. Our findings indicate a significant interaction between domain adaptation and topic modeling accuracy, especially regarding outliers and topic clarity. We release our code at https://github.com/bkolosk1/aham

Loading