Abstract: The widespread adoption of large language models (LLMs) has raised growing concerns over the misuse of AI-generated content in academic writing, misinformation, and digital manipulation. Detecting such LLM-generated text-especially under zero-shot settings-remains a significant challenge due to the limited generalizability of existing approaches. In this paper, we propose a zero-shot detection framework based on dual-network preference divergence. Specifically, we compare token-level log-likelihoods between two LLMs with divergent preference alignments: a human-preference model trained on natural language corpora, and a machine-preference model fine-tuned through instruction tuning and reinforcement learning. This divergence captures stylistic and distributional signals unique to LLM-generated text. We evaluate the proposed method across six datasets covering both Chinese and English texts, generated by LLMs including GLM4-9B, Qwen2.5-14B, InternLM2.5-20B, and GPT-4. The proposed method demonstrates strong generalization across diverse domains, parameters, and prompts, achieving AUC improvements ranging from 2% to 67% over state-of-the-art baselines on the majority of datasets. Additionally, it maintains a low false-positive rate, which makes it suitable for sensitive applications such as academic integrity detection. Furthermore, the proposed method provides fine-grained interpretability by highlighting suspicious tokens, enabling transparent and explainable detection outcomes.
Loading