Tokenizer-Agnostic Transferable Attacks on Language Models for Enhanced Red Teaming

26 Sept 2024 (modified: 16 Dec 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Adversarial Attacks, Red Teaming, Transferable Attacks, AI Safety, Large Language Models
TL;DR: We introduce TIARA, a tokenizer-independent method for generating transferable adversarial attacks on language models for enhanced red teaming.
Abstract: Large Language Models (LLMs) have become increasingly prevalent, raising concerns about potential vulnerabilities and misuse. Effective red teaming methods are crucial for improving AI safety, yet current approaches often require access to model internals or rely on specific jailbreak techniques. We present TIARA (Tokenizer-Independent Adversarial Red-teaming Approach), a novel method for automated red teaming of LLMs that advances the state-of-the-art in transferable adversarial attacks. Unlike previous token-level methods, TIARA eliminates constraints on gradient access and fixed tokenizer, enabling simultaneous attacks on multiple models with diverse architectures. By leveraging a combination of teacher-forcing and auto-regressive loss functions with a multi-stage candidate selection procedure, it achieves superior performance without relying on gradient information or dedicated attacker models. TIARA attains an 82.9\% attack success rate on GPT-3.5 Turbo and 51.2\% on Gemini Pro, surpassing previous transfer and direct attacks on the HarmBench benchmark. We provide insights into adversarial string length effects and present a qualitative analysis of discovered adversarial techniques. This work contributes to AI safety by offering a robust, versatile tool for identifying potential vulnerabilities in LLMs, facilitating the development of safer AI systems.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 8020
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview