Is the Cure Still Worse Than the Disease? Test Overfitting by LLMs in Automated Program Repair

Toufique Ahmed, Jatin Ganhotra, Avraham Shinnar, Martin Hirzel

Published: 2025, Last Modified: 10 Feb 2026CoRR 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Automated program repair has been shown to be susceptible to generating repaired code that passes on seen tests but fails on a hold-out set of hidden tests. This problem, dubbed test overfitting, has been identified and studied before the rise of large language models. We experimentally study how much test overfitting is still a problem today, using repository-level SWE-bench tasks.

External IDs:dblp:journals/corr/abs-2511-16858