Breaking Bad: Interpretability-Based Safety Audits of State-of-the-Art LLMs

Published: 27 Oct 2025, Last Modified: 27 Oct 2025NeurIPS Lock-LLM Workshop 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLMs, interpretability, jailbreaking, steering, vulnerabilities
TL;DR: We audit eight open-source LLMs from Llama3, GPT-oss, Qwen3, Phi4 family of models, using interpretability-based jailbreaking via proposed two-stage steering coefficient search algorithm, exposing varying safety vulnerabilities and dual-use risks.
Abstract: Effective safety auditing of large language models (LLMs) demands tools that go beyond black-box probing and systematically uncover vulnerabilities rooted in model internals. We present a comprehensive, interpretability-driven jailbreaking audit of eight SOTA open-source LLMs: Llama-3.1-8B, Llama-3.3-70B-4bt, GPT- oss-20B, GPT-oss-120B, Qwen3-0.6B, Qwen3-32B, Phi4-3.8B, and Phi4-14B. Leveraging interpretability-based approaches – Universal Steering (US) and Representation Engineering (RepE) – we introduce an adaptive two-stage grid search algorithm to identify optimal activation-steering coefficients for unsafe behavioral concepts. Our evaluation, conducted on a curated set of harmful queries and a standardized LLM-based judging protocol, reveals stark contrasts in model robust- ness. The Llama-3 models are highly vulnerable, with up to 91% (US) and 83% (RepE) jailbroken responses on Llama-3.3-70B-4bt, while GPT-oss-120B remains robust to attacks via both interpretability approaches. Qwen and Phi models show mixed results, with the smaller Qwen3-0.6B and Phi4-3.8B mostly exhibiting lower jailbreaking rates, while their larger counterparts are more susceptible. Our results establish interpretability-based steering as a powerful tool for systematic safety audits, but also highlight its dual-use risks and the need for better internal defenses in LLM deployment.
Submission Number: 28
Loading