Code-of-thought prompting: Probing AI Safety with Code

Ujwal Narayan; Shreyas Chaudhari; Ashwin Kalyan; Tanmay Rajpurohit; Karthik R Narasimhan; Ameet Deshpande; Vishvak Murahari

Code-of-thought prompting: Probing AI Safety with Code

Ujwal Narayan, Shreyas Chaudhari, Ashwin Kalyan, Tanmay Rajpurohit, Karthik R Narasimhan, Ameet Deshpande, Vishvak Murahari

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Toxicity analysis, AI Safety, LLMs

Abstract: Large Language Models (LLMs) have rapidly advanced in multiple capabilities, such as text and code understanding, leading to their widespread use in a wide range of applications, such as healthcare, education, and search. Due to the critical nature of these applications, there has been a heightened emphasis on aligning these models to human values and preferences to improve safety and reliability. In this paper, we demonstrate that contemporary efforts fall severely short of the ultimate goal of AI safety and fail to ensure safe, non-toxic outputs. We systematically evaluate the safety of LLMs through a novel model interaction paradigm dubbed Code of Thought (CoDoT) prompting that transforms natural language (NL) prompts into pseudo-code. CoDoT represents NL inputs in a precise, structured, and concise form, allowing us to utilize its programmatic interface to test several facets of AI safety. Under the CoDoT prompting paradigm, we show that a wide range of large language models emit highly toxic outputs with the potential to cause great harm. CoDoT leads to a staggering 16.5× increase in toxicity on GPT-4 TURBO and a massive 4.6 x increase on average, across multiple models and languages. Notably, we find that state-of-the-art mixture-of-experts (MoE) models are approximately 3x more susceptible to toxicity than standard architectures. Our findings raise a troubling concern that recent safety and alignment efforts have regressed LLMs and inadvertently introduced safety backdoors and blind spots. Our work calls for an urgent need to rigorously evaluate the design choices of safety efforts from first principles, given the rapid adoption of LLMs.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 12289

Loading