EvoBench: Towards Real-world LLM-Generated Text Detection Benchmarking for Evolving Large Language Models

EvoBench: Towards Real-world LLM-Generated Text Detection Benchmarking for Evolving Large Language Models

ACL ARR 2025 February Submission3746 Authors

15 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: With the widespread of Large Language Models (LLMs), there has been an increasing need to detect LLM-generated texts, prompting extensive research in this area. However, existing detection methods mainly evaluate on static benchmarks, which neglect the evolving nature of LLMs. Relying on existing static benchmarks could create a misleading sense of security, overestimating the real-world effectiveness of detection methods. To bridge this gap, we introduce EvoBench, a dynamic benchmark considering a new dimension of generalization across continuously evolving LLMs. EvoBench categorizes the evolving LLMs into (1) updates over time and (2) developments like finetuning and pruning, covering $7$ LLM families and their $30$ evolving versions. To measure the generalization across evolving LLMs, we introduce a new EMG (Evolving Model Generalization) metric. Our evaluation of $14$ detection methods on EvoBench reveals that they all struggle to maintain generalization when confronted with evolving LLMs. To mitigate the generalization problems, we further propose improvement strategies, demonstrating EMG performance improvements up to 12.15\%. Our research sheds light on critical challenges in real-world LLM-generated text detection and represents a significant step toward practical applications.

Paper Type: Long

Research Area: Ethics, Bias, and Fairness

Research Area Keywords: LLM-generated text detection, Evolving large language models

Contribution Types: NLP engineering experiment, Data resources

Languages Studied: English

Submission Number: 3746

Loading