SemanticCamo: Jailbreaking Large Language Models through Semantic Camouflage

SemanticCamo: Jailbreaking Large Language Models through Semantic Camouflage

ACL ARR 2025 February Submission3392 Authors

15 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: The rapid development and increasingly widespread applications of Large Language Models (LLMs) have made the safety issues of LLMs more prominent and critical. Although safety training is widely used in LLMs, the mismatch between pre-training and safety training still leads to safety vulnerabilities. To expose the safety vulnerabilities in LLMs and improve LLMs' performance in safety, we propose a novel framework, SemanticCamo, which attacks LLMs through semantic camouflage. SemanticCamo bypasses safety guardrails by replacing the original unsafe content with semantic features, thereby concealing malicious intent while keeping the query's semantics unchanged. We conduct comprehensive experiments on the state-of-the-art LLMs, including GPT-4o and Claude-3.5, finding that SemanticCamo successfully induces harmful responses from the target models in over 80\% of cases on average, outperforming previous counterparts. Additionally, the performance of SemanticCamo against various defenses is evaluated, demonstrating that semantic transformations introduce critical challenges to LLM safety, necessitating targeted alignment strategies to address this vulnerability. Our code will be available on Github.

Paper Type: Long

Research Area: Language Modeling

Research Area Keywords: red teaming, security and privacy, jailbreak attack

Contribution Types: NLP engineering experiment

Languages Studied: English

Submission Number: 3392

Loading