TL;DR: We introduce Emoji Attack, an adversarial strategy that exploits token segmentation bias in Judge LLMs by inserting emojis to manipulate tokenization, enhancing the effectiveness of jailbreak attacks against Judge LLM detection.
Abstract: Jailbreaking techniques trick Large Language Models (LLMs) into producing restricted output, posing a potential threat. One line of defense is to use another LLM as a Judge to evaluate the harmfulness of generated text. However, we reveal that these Judge LLMs are vulnerable to token segmentation bias, an issue that arises when delimiters alter the tokenization process, splitting words into smaller sub-tokens. This alters the embeddings of the entire sequence, reducing detection accuracy and allowing harmful content to be misclassified as safe. In this paper, we introduce Emoji Attack, a novel strategy that amplifies existing jailbreak prompts by exploiting token segmentation bias. Our method leverages in-context learning to systematically insert emojis into text before it is evaluated by a Judge LLM, inducing embedding distortions that significantly lower the likelihood of detecting unsafe content. Unlike traditional delimiters, emojis also introduce semantic ambiguity, making them particularly effective in this attack. Through experiments on state-of-the-art Judge LLMs, we demonstrate that Emoji Attack substantially reduces the unsafe prediction rate, bypassing existing safeguards.
Lay Summary: Large Language Models (LLMs) are not only used to generate text but also to screen it for safety. Before a message goes public, a specialized "Judge LLM" often checks whether the content is harmful—such as hate speech, violence, or misinformation—and blocks it if necessary. However, our research reveals a subtle flaw in how these Judge LLMs understand language. LLMs process text by breaking it into small pieces called tokens. If you change where those breaks occur (even without altering the actual words) you can confuse the model’s understanding. We found that inserting emojis 😊🔥💀 into a generated response is a surprisingly effective way to do this. Emojis don’t just split words into unusual token fragments, they also introduce ambiguity. For example, 🔥 might mean something is exciting or literally on fire, 💀 could signal humor or real-world danger, and 😊 may soften the tone of a toxic message, misleading the model into thinking it’s harmless. We call this method the Emoji Attack. By weaving emojis into generated text, this strategy tricks Judge LLMs into overlooking dangerous content. In tests on ten advanced moderation models, the presence of emojis reduced their ability to flag unsafe messages by over 14%, especially when combined with existing jailbreak techniques. As emojis become more common in everyday communication, this research highlights a key vulnerability in AI safety systems. Developers must build defenses that can reliably interpret meaning, even when messages are masked in subtle, emoji-filled ways. 🤖🔍
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Link To Code: https://github.com/zhipeng-wei/EmojiAttack
Primary Area: Social Aspects->Safety
Keywords: LLM safety; Jailbreaking Attacks; Judge LLMs; Token Segmentation
Submission Number: 16371
Loading