CLCoSum: Curriculum Learning-Based Code Summarization for Code Language Models

Hongkui He, Jiexin Wang, Liuwen Cao, Yi Cai

Published: 2025, Last Modified: 06 Jan 2026ICPC 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: The code summarization task aims to automatically generate natural language descriptions for code snippets. Recently, pre-trained code language models (CLMs) have demonstrated outstanding performance on code summarization. Additionally, researchers have shown that there is a strong correlation between code function names and summaries, and poorly defined function names lead to worse summaries generated by models. To mitigate this issue, in this paper, we propose CLCoSum, a curriculum learning-based code summarization method for CLMs that improves their performance in poorly named function scenarios. CLCoSum helps CLMs avoid overreliance on function names when they are poorly defined. First, CLCoSum employs data augmentation operators on function names to generate semantically equivalent poorly named codes, which are considered harder data and assist in reducing the model's reliance on unclear function names. Subsequently, CLCoSum uses a curriculum learning paradigm to allow the model to learn these harder codes in an organized way during finetuning. This approach enables CLMs to progress from easier to more difficult training data, similar to the human learning process. Extensive experiments on two existing datasets for Java and Python demonstrate that CLCoSum boosts the performance of various CLMs in code summarization. Specifically, the improvements in BLEU-4 score range from approximately 5% to $\mathbf{2 0. 8 \%}$. The fine-tuning speed of CLCoSum on the augmented dataset is also competitive. Our code and data are available at https://github.com/KuiH/CLCoSum.
Loading