Keywords: large language models, code generation, controlled generation, attacks, reliability, reinforcement learning
Abstract: We propose BreaC, a new method for attacking large language models (LLMs) to excessively generate erroneous code. BreaC works by training a class-conditional language model (CCLM) that conditions code generation on a binary attribute specifying whether the output code should contain errors. The CCLM is not only able to generate erroneous programs but can also control other, much larger LLMs to do so without access to their weights. The training of the CCLM leverages unlikelihood training, as well as reinforcement learning that treats the two generation branches of the CCLM as adversaries. We instantiate BreaC on the task of generating code with compilation and parsing errors. Our extensive evaluation demonstrates that BreaC is effective in both adversarial and benign scenarios. For the adversarial scenario, BreaC greatly reduces the compilation rate of various LLMs while maintaining the perplexity of generated programs. For the benign scenario, BreaC is able to produce realistic erroneous programs from correct programs, enabling one to construct parallel training datasets. We demonstrate the high utility of these datasets by training neural bug fixers that significantly surpass the state-of-the-art.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Social Aspects of Machine Learning (eg, AI safety, fairness, privacy, interpretability, human-AI interaction, ethics)
TL;DR: We present BreaC, a novel method for breaking large language model-based code generators such that they excessively generate erroneous code.