Code Comment Classification with Data Augmentation and Transformer-Based Models

Published: 01 Jan 2025, Last Modified: 19 Aug 2025NLBSE@ICSE 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Effective classification of code comment sentences into meaningful categories is critical for software comprehension and maintenance. In this work, we present a solution for the NLBSE'25 Code Comment Classification Tool Competition, achieving a 6.7% improvement in accuracy over the baseline STACC models. Our solution employs a multi-step methodology, beginning with translation-retranslation techniques to generate synthetic datasets. By translating the original dataset into multiple languages and back into English, we introduce linguistic diversity that enriches the training data and improves model generalization. We fine-tuned transformer-based architectures, including BERT, CodeBERT, RoBERTa, and DistilBERT, on this augmented dataset. After extensive evaluation, the best-performing model is selected for a robust multi-label classification framework tailored to Java, Python, and Pharo databases. The framework is designed to address the unique challenges of each programming language, ensuring high precision, recall, and F1 scores across all 19 categories. The source code is publicly available at https://github.com/Musfiqur6087/NLBSE-25, and the trained model can be accessed at https://huggingface.comMushfiqurRR.
Loading