Keywords: LLM, agents, self-improvement
Abstract: Coding agents can recursively modify their own implementations, forming a loop of self improvement. While prior work shows this can boost performance on coding benchmarks, existing approaches are costly and compute-intensive. We introduce a simple, sample-efficient self-improvement framework that significantly improves coding performance under strict budget constraints. We identify evaluation of candidate self-modifications as the main bottleneck since prior approaches estimate their effectiveness by re-running a subset of benchmark tasks with the modified agent. We replace these partial benchmark runs with an LLM-as-a-judge signal to rank candidate patches, reserving expensive evaluations only for the most promising ones. Combined with a lightweight tree search, this enables effective exploration with minimal overhead. On a subset of SWE-bench Verified, our method improves gpt-5-mini pass rate by over 11\% in fewer than three self-improvement steps, using just \$25 in API costs within 15 CPU hours.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 126
Loading