Toward Undetectable AI Text: AIGT detection evasion with representation editing

18 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: AI generated text detection evasion, large language model, representation editing
Abstract: With the growing popularity of large language models (LLMs), some concerns have been raised, such as misinformation, plagiarism, and deceptive reviews. Building an efficient and robust AI-generated text (AIGT) detection system has become an urgent demand. To comprehensively assess the robustness of detectors prior to deployment, evasion methods gradually attract the attention of the research community. Existing evasion methods mainly fine-tuned LLMs to align their outputs with human-written text (HWT), which required substantial data and computational resources. Moreover, although leveraging model editing to directly modify the weights of LLMs can significantly reduce the training costs, the evasion performance is not significantly enhanced due to intrinsic limitation of the model-editing theory. To address these limitations, we propose Representation Editing Attack (R-EAT), a training-free evasion method. R-EAT first constructs a difference space between AIGT and HWT. Then, it dynamically edits LLM hidden representation during generation by removing their projections onto this space, thereby encouraging the model to produce more human-like texts. Through theoretical analysis, we demonstrate that R-EAT achieves superior performance by directly editing hidden states, thereby eliminating the inherent limitations of model editing while preserving its advantages in sample and time efficiency. Experimental results demonstrate that the R-EAT effectively reduces the average detection accuracy of 8 AIGT detectors across texts generated by two different LLMs.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 10147
Loading