Code-of-thought prompting: Probing AI Safety with Code

ICLR 2025 Conference Submission12289 Authors

27 Sept 2024 (modified: 02 Dec 2024)ICLR 2025 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Toxicity analysis, AI Safety, LLMs
Abstract: Large Language Models (LLMs) have rapidly advanced in multiple capabilities, such as text and code understanding, leading to their widespread use in a wide range of applications, such as healthcare, education, and search. Due to the critical nature of these applications, there has been a heightened emphasis on aligning these models to human values and preferences to improve safety and reliability. In this paper, we demonstrate that contemporary efforts fall severely short of the ultimate goal of AI safety and fail to ensure safe, non-toxic outputs. We systematically evaluate the safety of LLMs through a novel model interaction paradigm dubbed Code of Thought (CoDoT) prompting that transforms natural language (NL) prompts into pseudo-code. CoDoT represents NL inputs in a precise, structured, and concise form, allowing us to utilize its programmatic interface to test several facets of AI safety. Under the CoDoT prompting paradigm, we show that a wide range of large language models emit highly toxic outputs with the potential to cause great harm. CoDoT leads to a staggering 16.5× increase in toxicity on GPT-4 TURBO and a massive 4.6 x increase on average, across multiple models and languages. Notably, we find that state-of-the-art mixture-of-experts (MoE) models are approximately 3x more susceptible to toxicity than standard architectures. Our findings raise a troubling concern that recent safety and alignment efforts have regressed LLMs and inadvertently introduced safety backdoors and blind spots. Our work calls for an urgent need to rigorously evaluate the design choices of safety efforts from first principles, given the rapid adoption of LLMs.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 12289
Loading