Abstract: The rapid development and increasingly widespread applications of Large Language Models (LLMs) have made the safety issues of LLMs more prominent and critical. Although safety training is widely used in LLMs, the mismatch between pre-training and safety training still leads to safety vulnerabilities. To expose the safety vulnerabilities in LLMs and improve LLMs' performance in safety, we propose a novel framework, SemanticCamo, which attacks LLMs through semantic camouflage. SemanticCamo bypasses safety guardrails by replacing the original unsafe content with semantic features, thereby concealing malicious intent while keeping the query's semantics unchanged. We conduct comprehensive experiments on the state-of-the-art LLMs, including GPT-4o and Claude-3.5, finding that SemanticCamo successfully induces harmful responses from the target models in over 80\% of cases on average, outperforming previous counterparts. Additionally, the performance of SemanticCamo against various defenses is evaluated, demonstrating that semantic transformations introduce critical challenges to LLM safety, necessitating targeted alignment strategies to address this vulnerability. Our code will be available on Github.
Paper Type: Long
Research Area: Language Modeling
Research Area Keywords: red teaming, security and privacy, jailbreak attack
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 3392
Loading