Can LLMs be Fooled: A Textual Adversarial Attack method via Euphemism Rephrase to Large Language Models

Can LLMs be Fooled: A Textual Adversarial Attack method via Euphemism Rephrase to Large Language Models

ICLR 2026 Conference Submission14731 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Textual Adversarial Attack, Euphemism Rephrase, Large Language Models (LLMs), Text Quality Evaluation

Abstract: Large Language Models (LLMs) have shown their great power in addressing masses of challenging problems in various areas, including textual adversarial attack and defense. With the fast evolution of LLMs, the traditional textual adversarial attack strategies, such as character-level, word-level, and sentence-level attacks, can no longer work on large models at all. In this paper, we propose an adversarial attack method via euphemism rephrase to LLMs (short for EuphemAttack), which can still deceive LLMs without altering the original meaning and being understandable to humans. Specifically, the perturbation instructions are designed to generate linguistically coherent and human-like adversarial examples, and a dual-layer hybrid filter is integrated to ensure both semantic similarity and linguistic naturalness. Our EuphemAttack aims to rephrase the original statement into implicit, euphemistic, or ironic expressions that are prevalent in everyday language, which can maintain semantic fidelity and entity consistency while subtly altering sentiment cues to mislead LLMs. The experiments on the state-of-the-art LLMs, including GPT-4 and DeepSeek, demonstrate the effectiveness of our EuphemAttack. Through a comprehensive evaluation that includes coherence, fluency, grammar, and naturalness, our EuphemAttack can significantly better maintain text quality in contrast to other attack methods.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 14731

Loading