Jailbreaking Open-Source LLMs with Logits Bias Injection

Jailbreaking Open-Source LLMs with Logits Bias Injection

ACL ARR 2025 February Submission3518 Authors

15 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: As large language models (LLMs) are increasingly deployed in sensitive and specialized applications, it is vital to ensure their robustness. A major challenge is their vulnerability to adversarial attacks, which exploit inherent weaknesses and can cause them to generate unintended or harmful responses. Although considerable efforts have been made in the post-training phase of LLMs to detect these vulnerabilities, jailbreak attacks continue to evolve, shifting from prompt-level exploits to model-level and even gradient-level optimization, and affecting both white-box and complex black-box models. This paper proposes a novel gradient-free adversarial attack, named Logits Bias Injection (LBI), designed for open-weight LLMs. LBI directly manipulates the logits at inference time by enforcing a predefined template sequence during text generation. This sequence includes answer-confirmation text, which encourages the model to follow it, thereby enabling deception even when adversarial instructions are embedded at the prompt level. Experiments show that LBI achieves a state-of-the-art jailbreak success rate on current benchmarks across multiple LLMs.

Paper Type: Short

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: adversarial attacks, security, generative models

Contribution Types: Model analysis & interpretability, NLP engineering experiment

Languages Studied: English

Submission Number: 3518

Loading