MLDebugging: Towards Benchmarking Code Debugging Across Multi-Library Scenarios

MLDebugging: Towards Benchmarking Code Debugging Across Multi-Library Scenarios

ACL ARR 2025 February Submission4151 Authors

15 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Code debugging is a crucial task in software engineering, which attracts increasing attention. While remarkable success has been made in the era of large language models (LLMs), current research still focuses on the simple no-library or single-library setting, ignoring the complex multi-library scenario in real-world applications. To address this limitation, we make the first attempt to introduce MLDebugging (Multi-Library Debugging), a comprehensive benchmark designed to assess debugging challenges within multi-library Python code. Specifically, MLDebugging encompasses 126 distinct Python libraries, covering a wide range of multi-library code issues, categorized into seven distinct types. Furthermore, we conduct a thorough evaluation of MLDebugging using both mainstream open-source and closed-source LLMs and highlight that current LLMs still struggle to correctly perform code debugging across multi-library scenarios. We hope this work can uncover the potential of LLMs in multi-library debugging scenario and offer insights for future research.

Paper Type: Long

Research Area: NLP Applications

Research Area Keywords: Multi-Library Code Debugging, Code Debugging, Automated Program Repair

Contribution Types: Data resources, Data analysis

Languages Studied: English

Submission Number: 4151

Loading