Are LLMs Safe Beyond Text: Do Emojis Expose Gaps in Safety Evaluation

M P V S GOPINADH

Are LLMs Safe Beyond Text: Do Emojis Expose Gaps in Safety Evaluation

M P V S GOPINADH

Published: 29 Apr 2026, Last Modified: 29 Apr 2026Eval Eval @ ACL 2026 PosterEveryoneRevisionsCC BY 4.0

Keywords: LLM evaluation, safety alignment, adversarial prompting, emoji prompts, robustness, evaluation methodology

TL;DR: Emoji-based prompts reveal that text-only safety evaluations may overlook representation-dependent vulnerabilities in LLMs.

Abstract: Safety evaluations of large language models (LLMs) predominantly rely on text-based adversarial prompts, potentially overlooking vulnerabilities arising from alternative input representations. We examine emoji-augmented prompts as a test case for this gap, evaluating 50 prompts across four open-source LLMs (Mistral 7B, Qwen 2 7B, Gemma 2 9B, Llama 3 8B). Results show substantial variation in robustness: Gemma 2 9B and Mistral 7B exhibit non-zero success rates (10%), Llama 3 8B 6%, while Qwen 2 7B shows complete resistance (0% success rate). A chi-square test (χ² = 32.94, p < 0.001) confirms significant differences in outcome distributions. These findings indicate that robustness is sensitive to input representation, and that evaluations restricted to standard text prompts may underrepresent model vulnerabilities.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Type: Research Paper

Archival Status: Non-archival

Submission Number: 96

Loading