Keywords: Large Language Models, Safety, Concurrency, Jailbreak
TL;DR: This work evaluates LLMs' utility and risk in a concurrent scenario and proposes a novel jailbreak attack against LLMs based on task concurrency.
Abstract: Despite serving as powerful foundations for a wide range of downstream applications, large language models (LLMs) remain vulnerable to misuse for generating harmful content, a risk that has been further amplified by various jailbreak attacks.
Existing jailbreak attacks mainly follow sequential logic, where LLMs understand and answer each given task one by one.
However, concurrency, a natural extension of the sequential scenario, has been largely overlooked.
In this work, we first propose a word-level method to enable task concurrency in LLMs, where adjacent words encode divergent intents.
Although LLMs maintain strong utility in answering concurrent tasks, which is demonstrated by our evaluations on mathematical and general question-answering benchmarks, we notably observe that combining a harmful task with a benign one significantly reduces the probability of it being filtered by the guardrail, showing the potential risks associated with concurrency in LLMs.
Based on these findings, we introduce $\texttt{JAIL-CON}$, an iterative attack framework that $\underline{\text{JAIL}}$breaks LLMs via task $\underline{\text{CON}}$currency.
Experiments on widely-used LLMs demonstrate the strong jailbreak capabilities of $\texttt{JAIL-CON}$ compared to existing attacks.
Furthermore, when the guardrail is applied as a defense, compared to the sequential answers generated by previous attacks, the concurrent answers in our $\texttt{JAIL-CON}$ exhibit greater stealthiness and are less detectable by the guardrail, highlighting the unique feature of task concurrency in jailbreaking LLMs.
Submission Number: 17
Loading