Adversarial Reasoning at Jailbreaking Time

Mahdi Sabbaghi; Paul Kassianik; George J. Pappas; Amin Karbasi; Hamed Hassani

Adversarial Reasoning at Jailbreaking Time

Mahdi Sabbaghi, Paul Kassianik, George J. Pappas, Amin Karbasi, Hamed Hassani

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We employ a reasoning-driven framework to formulate and solve the jailbreaking problem as an optimization task.

Abstract: As large language models (LLMs) are becoming more capable and widespread, the study of their failure cases is becoming increasingly important. Recent advances in standardizing, measuring, and scaling test-time compute suggest new methodologies for optimizing models to achieve high performance on hard tasks. In this paper, we apply these advances to the task of model jailbreaking: eliciting harmful responses from aligned LLMs. We develop an adversarial reasoning approach to automatic jailbreaking that leverages a loss signal to guide the test-time compute, achieving SOTA attack success rates against many aligned LLMs, even those that aim to trade inference-time compute for adversarial robustness. Our approach introduces a new paradigm in understanding LLM vulnerabilities, laying the foundation for the development of more robust and trustworthy AI systems.

Lay Summary: Large language models (LLMs) can still be tricked into producing disallowed or dangerous content, even after extensive safety‑training. We show that breaking these defences is far more effective when the attacker treats the task as a reasoning problem rather than a trial‑and‑error search. Our framework, Adversarial Reasoning, deploys several reasoning modules that perform specific roles to build the reasoning trajectories and explore the ideas. In just 15 iterations the method reliably outperforms prior jailbreak techniques across several frontier models. These results reveal that giving adversaries more "thinking time" at test‑time can break the safety measures. To support more secure future systems, we have released our code.

Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.

Link To Code: https://github.com/Helloworld10011/Adversarial-Reasoning

Primary Area: Social Aspects->Safety

Keywords: LLMs, Jailbreaking, Reasoning, Test-time compute.

Submission Number: 13747

Loading