Searching Strengthens Large Language Models in Finding Bugs of Deep Learning Libraries

Zhen Wang; Jie Wang; Hui-Ling Zhen; Lihao Yin; Zhihai Wang; Mingxuan Yuan; Jianye HAO; Feng Wu

Searching Strengthens Large Language Models in Finding Bugs of Deep Learning Libraries

Zhen Wang, Jie Wang, Hui-Ling Zhen, Lihao Yin, Zhihai Wang, Mingxuan Yuan, Jianye HAO, Feng Wu

27 Sept 2024 (modified: 25 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Model, Code Generation, Fuzz, Software Engineering, Optimize, Search

Abstract: Ensuring the quality of deep learning libraries is crucial, as bugs can have significant consequences for downstream software. Fuzzing, a powerful testing method, generates random programs to test software. Generally, effective fuzzing requires generated programs to meet three key criteria: rarity, validity, and variety, among which rarity is most critical for bug detection, as it determines the algorithm's ability to detect bugs. However, current large language model (LLM) based fuzzing approaches struggle to effectively explore the program generation space which results in insufficient rarity and the lack of post-processing leads to a large number of invalid programs and inadequate validity. This paper proposes EvAFuzz, a novel approach that combines Evolutionary Algorithms with LLMs to Fuzz DL libraries. For rarity, EvAFuzz uses a search algorithm to guide LLMs in efficiently exploring the program generation space, iteratively generating increasingly rare programs. For validity, EvAFuzz incorporates a feedback scheme, enabling LLMs to correct invalid programs and achieve high validity. For variety, EvAFuzz constructs a large parent selection space, enriching the diversity of selected parents, and thereby enhancing the variety of generated programs. Our experiments show that EvAFuzz outperforms the previous state-of-the-art (SOTA) in several key metrics. First, in the same version of PyTorch, EvAFuzz detects nine unique crashes, surpassing the SOTA's seven. Next, our method achieves a valid rate of 38.80%, significantly higher than the SOTA's 27.69%. Last, EvAFuzz achieves API coverage rates of 99.49% on PyTorch and 85.76% on TensorFlow, outperforming the SOTA's rates of 86.44% on PyTorch and 69.63% on TensorFlow. These results indicate that our method generates programs with higher rarity, validity, and variety, respectively.

Primary Area: other topics in machine learning (i.e., none of the above)

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 9407

Loading