Keywords: llm-generated text detection, editing tasks, wikipedia, benchmark, multilingual
TL;DR: We propose an LLM-generated text detection benchmark for realistic editing tasks and show that state-of-the-art detectors considerably underperform.
Abstract: Automatically detecting machine-generated text (MGT) is critical to maintaining the knowledge integrity of user-generated content (UGC) platforms such as Wikipedia.
Existing detection benchmarks primarily focus on \textit{generic} text generation tasks (e.g., ``Write an article about machine learning.'').
However, editors frequently employ LLMs for specific writing tasks (e.g., summarisation).
These \textit{task-specific} MGT instances tend to resemble human-written text more closely due to their constrained task formulation and contextual conditioning.
In this work, we show that a range of MGT detectors struggle to identify task-specific MGT reflecting real-world editing on Wikipedia.
We introduce \textsc{TSM-Bench}, a multilingual, multi-generator, and multi-task benchmark for evaluating MGT detectors on common, real-world Wikipedia editing tasks.
Our findings demonstrate that (\textit{i}) average detection accuracy drops by 10--40\% compared to prior benchmarks, and (\textit{ii}) a generalisation asymmetry exists: fine-tuning on task-specific data enables generalisation to generic data---even across domains---but not vice versa.
We demonstrate that models fine-tuned exclusively on generic MGT overfit to superficial artefacts of machine generation.
Our results suggest that, in contrast to prior benchmarks, most detectors remain unreliable for automated detection in real-world contexts such as UGC platforms.
\textsc{TSM-Bench} therefore provides a crucial foundation for developing and evaluating future models.
Primary Area: datasets and benchmarks
Submission Number: 22069
Loading