EvoBench: Towards Real-world LLM-Generated Text Detection Benchmarking for Evolving Large Language Models
Abstract: With the widespread of Large Language Models (LLMs), there has been an increasing need to detect LLM-generated texts, prompting extensive research in this area. However, existing detection methods mainly evaluate on static benchmarks, which neglect the evolving nature of LLMs. Relying on existing static benchmarks could create a misleading sense of security, overestimating the real-world effectiveness of detection methods. To bridge this gap, we introduce EvoBench, a dynamic benchmark considering a new dimension of generalization across continuously evolving LLMs. EvoBench categorizes the evolving LLMs into (1) updates over time and (2) developments like finetuning and pruning, covering $7$ LLM families and their $30$ evolving versions. To measure the generalization across evolving LLMs, we introduce a new EMG (Evolving Model Generalization) metric. Our evaluation of $14$ detection methods on EvoBench reveals that they all struggle to maintain generalization when confronted with evolving LLMs. To mitigate the generalization problems, we further propose improvement strategies, demonstrating EMG performance improvements up to 12.15\%. Our research sheds light on critical challenges in real-world LLM-generated text detection and represents a significant step toward practical applications.
Paper Type: Long
Research Area: Ethics, Bias, and Fairness
Research Area Keywords: LLM-generated text detection, Evolving large language models
Contribution Types: NLP engineering experiment, Data resources
Languages Studied: English
Submission Number: 3746
Loading