How Robust Are Code Summarization Models to Poor-Readability Code? Fine-grained Evaluation and Benchmark
Abstract: Pre-trained language models such as CodeT5 have demonstrated substantial achievement in code comprehension. Despite the giant leap in model architectures and training processes, we find that the benchmarks used for evaluating code summarization tasks are confined to high-readability code, regardless of the popularity of poor-readability code in reality. As such, they are inadequate to demonstrate the fine-grained ability of models, particularly the robustness to varying readability degrees. In this paper, we introduce OR-CodeSum, a robust evaluation benchmark on code summarization tasks, including seven obfuscated datasets derived from existing datasets. OR-CodeSum innovatively introduces the construction rules of obfuscation code into the testing process, considering semantic, syntactic, and cross-obfuscation robustness of code summarization tasks. Our robustness evaluation reveals that the current code summarization models rely heavily on the readability of the code while not paying enough attention to the syntactic information. We believe OR-CodeSum can help researchers obtain a more comprehensive and profound understanding of code summarization models, which facilitates the improvement of model performance.
Paper Type: long
Research Area: Resources and Evaluation
Contribution Types: Model analysis & interpretability, Data resources
Languages Studied: Python, Java, Go
Preprint Status: There is no non-anonymous preprint and we do not intend to release one.
A1: yes
A1 Elaboration For Yes Or No: 8
A2: yes
A2 Elaboration For Yes Or No: 6
A3: yes
A3 Elaboration For Yes Or No: 1
B: yes
B1: yes
B1 Elaboration For Yes Or No: 3.1
B2: yes
B2 Elaboration For Yes Or No: 3.1
B3: yes
B3 Elaboration For Yes Or No: 3.1
B4: no
B4 Elaboration For Yes Or No: All data sets used are open source
B5: no
B5 Elaboration For Yes Or No: Research objects are programming languages
B6: yes
B6 Elaboration For Yes Or No: Appendix A
C: yes
C1: yes
C1 Elaboration For Yes Or No: Appendix A
C2: no
C2 Elaboration For Yes Or No: We use default parameters
C3: yes
C3 Elaboration For Yes Or No: 5
C4: yes
C4 Elaboration For Yes Or No: 4.4 Evaluation Metrics ; Appendix C Full Results and Results with Other Metrics
D: no
D1: n/a
D1 Elaboration For Yes Or No: There is no human annotators.
D2: n/a
D2 Elaboration For Yes Or No: There is no human annotators.
D3: n/a
D3 Elaboration For Yes Or No: There is no human annotators.
D4: n/a
D4 Elaboration For Yes Or No: There is no human annotators.
D5: n/a
D5 Elaboration For Yes Or No: There is no human annotators.
E: no
E1: n/a
E1 Elaboration For Yes Or No: The research does not need AI assistants
0 Replies
Loading