The Program Testing Ability of Large Language Models for Code

The Program Testing Ability of Large Language Models for Code

ACL ARR 2024 April Submission167 Authors

14 Apr 2024 (modified: 06 May 2024)ACL ARR 2024 April SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Recent development of large language models (LLMs) for code like CodeX and CodeT5+ demonstrates tremendous promise in achieving code intelligence. Their ability of synthesizing code that completes a program for performing a pre-defined task has been intensively tested and verified on benchmark datasets including HumanEval and MBPP. Yet, evaluation of these LLMs from more perspectives (than just program synthesis) is also anticipated, considering their broad scope of possible applications. In this paper, we explore their program testing ability. Analyzing in the task of automatic test cases generation, we show intriguing properties of these models and demonstrate how the quality of their generated test cases can be improved. Following recent work that uses generated test cases to enhance program synthesis, we further leverage our findings in improving the quality of the synthesized programs and show +11.77\% and +4.22\% higher code pass rates on HumanEval+ comparing with the GPT-3.5-turbo baseline and the recent state-of-the-art, respectively.

Paper Type: Long

Research Area: NLP Applications

Research Area Keywords: code generation and understanding

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 167

Loading