How Robust Are Code Summarization Models to Poor-Readability Code? Fine-grained Evaluation and Benchmark

Anonymous

How Robust Are Code Summarization Models to Poor-Readability Code? Fine-grained Evaluation and Benchmark

Anonymous

16 Feb 2024ACL ARR 2024 February Blind SubmissionReaders: Everyone

Abstract: Pre-trained language models such as CodeT5 have demonstrated substantial achievement in code comprehension. Despite the giant leap in model architectures and training processes, we find that the benchmarks used for evaluating code summarization tasks are confined to high-readability code, regardless of the popularity of poor-readability code in reality. As such, they are inadequate to demonstrate the fine-grained ability of models, particularly the robustness to varying readability degrees. In this paper, we introduce OR-CodeSum, a robust evaluation benchmark on code summarization tasks, including seven obfuscated datasets derived from existing datasets. OR-CodeSum innovatively introduces the construction rules of obfuscation code into the testing process, considering semantic, syntactic, and cross-obfuscation robustness of code summarization tasks. Our robustness evaluation reveals that the current code summarization models rely heavily on the readability of the code while not paying enough attention to the syntactic information. We believe OR-CodeSum can help researchers obtain a more comprehensive and profound understanding of code summarization models, which facilitates the improvement of model performance.

Paper Type: long

Research Area: Resources and Evaluation

Contribution Types: Model analysis & interpretability, Data resources

Languages Studied: Python, Java, Go

Preprint Status: There is no non-anonymous preprint and we do not intend to release one.

A1: yes

A1 Elaboration For Yes Or No: 8

A2: yes

A2 Elaboration For Yes Or No: 6

A3: yes

A3 Elaboration For Yes Or No: 1

B: yes

B1: yes

B1 Elaboration For Yes Or No: 3.1

B2: yes

B2 Elaboration For Yes Or No: 3.1

B3: yes

B3 Elaboration For Yes Or No: 3.1

B4: no

B4 Elaboration For Yes Or No: All data sets used are open source

B5: no

B5 Elaboration For Yes Or No: Research objects are programming languages

B6: yes

B6 Elaboration For Yes Or No: Appendix A

C: yes

C1: yes

C1 Elaboration For Yes Or No: Appendix A

C2: no

C2 Elaboration For Yes Or No: We use default parameters

C3: yes

C3 Elaboration For Yes Or No: 5

C4: yes

C4 Elaboration For Yes Or No: 4.4 Evaluation Metrics ; Appendix C Full Results and Results with Other Metrics

D: no

D1: n/a

D1 Elaboration For Yes Or No: There is no human annotators.

D2: n/a

D2 Elaboration For Yes Or No: There is no human annotators.

D3: n/a

D3 Elaboration For Yes Or No: There is no human annotators.

D4: n/a

D4 Elaboration For Yes Or No: There is no human annotators.

D5: n/a

D5 Elaboration For Yes Or No: There is no human annotators.

E: no

E1: n/a

E1 Elaboration For Yes Or No: The research does not need AI assistants

0 Replies

Loading