SEMINAR: SEMantic InformatioN Augmented JailbReak Attack in LLM

Published: 01 Sept 2025, Last Modified: 18 Nov 2025ACML 2025 Conference TrackEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large Language Models (LLMs) have been widely adopted in real-world applications, yet their safety remains a major concern, particularly regarding jailbreak attacks that bypass alignment safeguards to elicit harmful outputs. Among various attack strategies, optimization-based jailbreak attacks have emerged as a primary approach by designing specialized loss functions to optimize adversarial suffixes added after the harmful question. However, existing methods often suffer from poor generalization and over-refusal issues due to overly fixed optimization targets, which significantly undermine the utility of jailbreak attempts by yielding generic denials (e.g., "Sorry, I can’t assist with that") rather than harmful completions. These issues fundamentally stem from the rigid exact match constraint in their loss design. To address this, we propose SEMINAR, a novel semantic information-augmented optimization framework that promotes diverse and semantically aligned affirmative responses. Specifically, we leverages semantic-level supervision to guide the optimization toward intent-consistent outputs rather than rigid templates by introducing a non-exact match loss based on semantic similarity. Furthermore, we mitigate the token shift problem—the generation of LLM highly depends on the correctness of the first few tokens, but the loss is averaged over the entire sequence, which leads to insufficient attention paid to the early tokens in the optimization—by introducing a cosine decay scheduling mechanism that emphasizes the early tokens in the sequence into the optimization process. As a result, SEMINAR not only enhances the diversity of affirmative responses generated by LLMs but also significantly improves overall attack effectiveness. Extensive experiments demonstrate the superiority of SEMINAR over baseline methods, along with its strong transferability across different models.
Submission Number: 257
Loading