ACE: Self-Evolving LLM Coding Framework Adversarial Unit Test Generation and Preference Optimization

Published: 05 Mar 2026, Last Modified: 05 Mar 2026ICLR 2026 Workshop RSI SpotlightEveryoneRevisionsCC BY 4.0
Keywords: Large Language Models, Code Generation, Self-Evolving Systems, Adversarial Unit Testing, Execution-Based Supervision, Preference Optimization, Solver–Adversary Framework
TL;DR: We propose ACE, a framework that trains code-generating LLMs to self-improve using adversarial unit tests and execution outcomes alone, without relying on ground-truth code, expected outputs, or semantic verifiers.
Abstract: Large Language Models (LLMs) excel at code generation but remain heavily reliant on large-scale annotated solutions and verification-based supervision, which constrains scalability and hinders sustained self-improvement. Recent solver-verifier frameworks exploit program execution as an automatic supervision signal, but their effectiveness degrades as solvers become moderately strong: verifier-generated tests increasingly confirm semantic correctness rather than exposing the remaining failure modes. We propose \textbf{ACE}, a self-evolving code generation framework based on a solver–adversary architecture that prioritizes active failure discovery through execution-centric supervision. A single LLM alternates between generating candidate programs and producing adversarial unit test inputs optimized to induce execution-level failures, such as runtime errors, exceptions, or non-termination. Supervision is derived solely from execution outcomes: robust programs are selected for supervised fine-tuning, while adversarial tests are optimized via Kahneman–Tversky Optimization using execution-derived preferences. Notably, the entire training loop requires no ground-truth code, or external reward models. Experiments on CodeContests, MBPP, and LiveCodeBench demonstrate that ACE consistently outperforms strong solver-verifier baselines, achieving 3–7\% absolute gains in pass@1, with larger improvements on out-of-distribution benchmarks, while maintaining competitive or improved inference efficiency.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 74
Loading