Adapting Pretrained Language Models for Citation Classification via Self-Supervised Contrastive Learning

Published: 04 Aug 2025, Last Modified: 04 Feb 2026KDD 2025 Research Track FebruaryEveryoneCC BY 4.0
Abstract: Citation classification, which identifies the intention behind academic citations, is pivotal for scholarly analysis. Previous works suggest finetuning the encoder-based pretrained language models (PLMs) on citation classification datasets, reaping the reward of the linguistic knowledge they gained during pretraining. However, directly finetuning for citation classification is challenged because of labeled data scarcity, contextual noise, and spurious keyphrase correlations. In this paper, we present a novel framework that adapts the PLMs to overcome these challenges. Our approach introduces self-supervised contrastive learning to alleviate data scarcity and proposes two specialized strategies to obtain the contrastive pairs: sentence-level cropping, which enhances focus on target citations within long contexts, and keyphrase perturbation, which mitigates reliance on specific keyphrases. Compared with previous works that are only designed for encoder-based PLMs, our framework is carefully developed to be compatible with both encoder-based PLMs and decoder-based LLMs, to embrace the benefits of enlarged pretraining. Experiments with three benchmark datasets with both encoder-based PLMs and decoder-based LLMs demonstrate our superiority compared to the previous state-of-the-art.
Loading