Abstract: As large language models (LLMs) are increasingly deployed in sensitive and specialized applications, it is vital to ensure their robustness. A major challenge is their vulnerability to adversarial attacks, which exploit inherent weaknesses and can cause them to generate unintended or harmful responses. Although considerable efforts have been made in the post-training phase of LLMs to detect these vulnerabilities,
jailbreak attacks continue to evolve, shifting from prompt-level exploits to model-level and even gradient-level optimization, and affecting both white-box and complex black-box models.
This paper proposes a novel gradient-free adversarial attack, named Logits Bias Injection (LBI), designed for open-weight LLMs.
LBI directly manipulates the logits at inference time by enforcing a predefined template sequence during text generation.
This sequence includes answer-confirmation text, which encourages the model to follow it, thereby enabling deception even when adversarial instructions are embedded at the prompt level.
Experiments show that LBI achieves a state-of-the-art jailbreak success rate on current benchmarks across multiple LLMs.
Paper Type: Short
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: adversarial attacks, security, generative models
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: English
Submission Number: 3518
Loading