Keywords: Automated Program Repair, Adaptive Benchmarks, Evaluation Methodology, Large Language Models, Benchmark Generation
TL;DR: Static benchmarks for program repair are structurally broken and cannot be fixed by making them bigger — we propose replacing them with adaptive, specification-driven pipelines that generate infinite, verified repair instances on demand.
Abstract: Automated Program Repair (APR) has rapidly advanced with the emergence of Large Language Models (LLMs), and modern repair systems increasingly achieve high success rates on established benchmarks — raising concerns about evaluation saturation and distributional overfitting. This paper argues that the dominant paradigm of static benchmark evaluation is structurally inadequate, and that scaling static datasets cannot resolve this inadequacy. We propose a paradigm shift toward adaptive, specification-driven benchmark generation governed by five organizing principles: generative unboundedness, specification primacy, deterministic certification, adaptive coverage, and oracle independence. We formalize these principles, develop a taxonomy of the dimensions along which repair instances vary, and argue that the generator–verifier separation is the architectural consequence that makes the framework trustworthy. We treat the oracle problem as a conceptual issue in its own right, examine the tradeoffs among available correctness criteria, and identify the open problems the framework surfaces but does not resolve.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public.
Paper Type: Full-length papers (i.e. case studies, theoretical, applied research papers). 8 pages
Reroute: false
Submission Number: 55
Loading