ATGen: Adversarial Reinforcement Learning for Test Case Generation

ATGen: Adversarial Reinforcement Learning for Test Case Generation

ICLR 2026 Conference Submission18651 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Test Case Generation, Reinforcement Learning, Large Language Models, Code Generation

Abstract: Large Language Models (LLMs) show remarkable code generation capabilities but often produce imperfect code with subtle bugs. A critical bottleneck for improving code quality is the scarcity of high-quality test cases. Existing approaches, primarily based on Supervised Fine-Tuning (SFT) over static datasets, are limited in their ability to discover novel bugs and struggle with the fundamental trade-off between generating error-triggering inputs and maintaining correct expected outputs. To address these limitations, we reframe test case generation as an iterative, adversarial process. We introduce ATGEN (Adversarial Test Generator), a novel framework that trains a test case generator via Reinforcement Learning (RL) in an adversarial loop with an evolving code generator. Instead of learning from a fixed set of bugs, our test generator is dynamically trained to create "attacking" I/O pairs for buggy code that is itself being iteratively generated. This process is guided by a reward function that explicitly balances the dual objectives of maximizing the bug detection rate and maintaining high output accuracy. Extensive experiments show that ATGEN dramatically outperforms the state-of-the-art SFT-based approach, UTGen, improving IO Accuracy by nearly 40 absolute points (71.56% vs. 31.83%) and more than doubling the Attack Rate (34.02% vs. 16.24%). The adversarial curriculum is particularly effective for hard-to-detect bugs, achieving an attack rate more than double that of the strongest baseline. Furthermore, tests generated by ATGEN serve as a more effective filter in Best-of-N code generation, significantly closing the gap to the human expert upper bound. Our work establishes a new and more effective paradigm for automated test generation and debugging for LLMs.

Primary Area: generative models

Submission Number: 18651

Loading