SaFT: Spotting Style Imitation and Filtering Content Interference for Zero-Shot LLM-Generated Text Detection

Jiahao Wang; Feifei Kou; Kun Niu; Lei Shi; Jiwei Zhang; Siwei Wang; Pengfei Zhang; Suguo Zhu

SaFT: Spotting Style Imitation and Filtering Content Interference for Zero-Shot LLM-Generated Text Detection

Jiahao Wang, Feifei Kou, Kun Niu, Lei Shi, Jiwei Zhang, Siwei Wang, Pengfei Zhang, Suguo Zhu

17 Sept 2025 (modified: 28 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM-Generated Text Detection, Zero-Shot

TL;DR: We propose a novel framework termed SaFT to improve LLM-generated text detection accuracy by spotting style imitation and filtering content interference.

Abstract: Large language models (LLMs) have achieved advanced text generation capabilities, necessitating the development of reliable LLM-generated text detection to prevent potential misuse. However, current probability-based zero-shot detection methods face two critical challenges that reduce the detection accuracy of LLM-generated texts: the $\textit{style imitation challenge (SIC)}$ and the $\textit{content interference challenge (CIC)}$. The SIC arises as LLMs develop increasingly stronger abilities to mimic human writing styles, while the CIC occurs when surprising content characteristics interfere with probability analysis. To address these challenges, we propose $\textbf{\textit{SaFT}}$, a novel framework built upon $\textit{Style-Oriented Instruction Prefix (SOIP)}$ to guide probability analysis for spotting style imitation and filtering content interference. Our framework introduces $\textit{SIC-Detection (SIC-D)}$ that spots style imitation by making style-imitating texts less unexpected through probability analysis conditioned on human-style instructions, and $\textit{CIC-Detection (CIC-D)}$ that filters content interference by difference analysis between probability distributions conditioned on contrasting style instructions, exploiting the insight that identical models exhibit equivalent content-related surprises. The final detection score is composed of SIC-D and CIC-D components. Extensive experiments demonstrate that SaFT consistently outperforms existing state-of-the-art methods, achieving improvements of 4.9\% in average AUROC and 20.4\% in average TPR @ 10\% FPR.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 9438

Loading