A Fine-tuned Approach to Code Summarization Generation Based on Keyword Augmentation and Contrastive Learning
Abstract: Code summarization generation aims to automatically produce summaries from source code based on its analysis, providing information such as the design goals of code segments and relevant parameters. Recently, fine-tuning generalized code models, which are based on pre-trained large-scale code datasets, has garnered considerable attention. Several studies have proposed data augmentation methods to mitigate the limitation of dataset sizes on the effectiveness of fine-tuned models. However, these approaches continue to face challenges, such as the generation of low-quality augmented data and the limited ability of models to capture code representations. To address these issues, we propose CATS, a fine-tuning method for code summarization generation that incorporates keyword augmentation and contrastive learning. CATS employs a two-stage training strategy. In the first stage, a data augmentation method that retains keyword information is designed to construct similar code pairs, followed by a contrastive learning method to effectively capture code representations from these pairs. In the second stage, a generic autoregressive task for code summarization is used to further refine the high-quality code representations obtained in the first stage. The effectiveness of CATS is evaluated using three different pre-trained models: CodeBERT, GraphCodeBERT, and UniXcoder. Experimental results on two generalized datasets demonstrate that CATS, when applied to UniXcoder, significantly improves performance across all three evaluation metrics compared to the baseline methods. Moreover, a series of ablation experiments validate the contributions of CATS in terms of data augmentation, contrastive learning, and the two-stage training strategy.
External IDs:dblp:conf/ijcnn/WangJWZR25
Loading