Keywords: Large Language Models, Automated Program Repair, Benchmark, Robustness Testing
TL;DR: A Benchmark to test the robustness of Large Language Model's capability to repair bugs in codes
Abstract: Recent Large Language Models (LLMs) have shown strong performance on automated program repair across standard benchmarks.
However, these benchmarks evaluate models on a single canonical form of buggy code and do not reflect the syntactic variations com-
monly observed in real-world software, leaving robustness largely unexamined. In this work, we construct HEJ-Robust, a robustness
benchmark built from HumanEval-Java-Bug using eight semantic-preserving code transformations, resulting in 1,350 transformed
instances. We evaluate five fine-tuned LLMs on this benchmark and show that model performance drops by over 50% under several
transformations, indicating that current LLM-based repair models lack robustness to minor syntactic variations.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public.
Paper Type: Short papers (i.e., vision, new ideas, and position papers). 2–4 pages
Reroute: true
Submission Number: 52
Loading