HEJ-Robust: A Robustness Benchmark for LLM-based Automated Program Repair

Fazle Rabbi; Jinqiu Yang

HEJ-Robust: A Robustness Benchmark for LLM-based Automated Program Repair

Fazle Rabbi, Jinqiu Yang

Published: 14 May 2026, Last Modified: 14 May 2026AIWare 2026 Benchmark and DatasetEveryoneRevisionsCC BY 4.0

Keywords: Large Language Models, Automated Program Repair, Benchmark, Robustness Testing

TL;DR: HEJ-Robust is a robustness benchmark that reveals Large Language Models for automated program repair fail significantly under minor, semantics-preserving code transformations.

Abstract: Recent Large Language Models (LLMs) have shown strong performance on automated program repair across standard benchmarks. However, these benchmarks evaluate models on a single canonical form of buggy code and do not reflect the syntactic variations commonly observed in real-world software, leaving robustness largely unexamined. In this work, we construct HEJ-Robust, a robustness benchmark built from HumanEval-Java-Bug using eight semantic-preserving code transformations, resulting in 1,450 transformed instances. We evaluate five fine-tuned LLMs on this benchmark and show that model performance drops by over 50% under several transformations, indicating that current LLM-based repair models lack robustness to minor syntactic variations.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 11

Loading