Are Large Language Models Good Classifiers? A Study on Edit Intent Classification in Scientific Document Revisions
Classification is a core NLP task architecture with many potential applications. While large language models (LLMs) have brought substantial advancements in text generation, their potential for enhancing classification tasks remains underexplored. To address this gap, we propose a framework for thoroughly investigating fine-tuning LLMs for classification, including both generation- and encoding-based approaches. We instantiate this framework in edit intent classification (EIC), a challenging and underexplored classification task. Our extensive experiments and systematic comparisons with various training approaches and a representative selection of LLMs yield new insights into their application for EIC. To demonstrate the proposed methods and address the data shortage for empirical edit analysis, we use our best-performing model to create \textit{Re3-Sci2.0}, a new large-scale dataset of 1,780 scientific document revisions with over 94k labeled edits. The new dataset enables an in-depth empirical study of human editing behavior in academic writing. We make our experimental framework, models and data publicly available.