BRED: A Comprehensive Benchmark for the Robust Evaluation of LLM-Generated Text Detection in Realistic Scenarios
Keywords: AI generated text detection; LLM
Abstract: The rapid advancement of large language models (LLMs) has created a pressing need for robust detectors capable of distinguishing between machine-generated and human-written texts. However, existing benchmarks often lack the comprehensive scope needed for rigorous testing. We introduce BRED, a new benchmark that offers four key contributions: 1) extensive coverage of diverse domains and compositional operations, 2) in-depth analysis of LLM pipeline and compositional operations, 3) evaluation across different LLM variants and groups, and 4) in-depth exploration of supervised detectors. Through extensive evaluation of baseline detectors, we have three key findings: 1) supervised embedding-based detectors are the most robust against diverse generation strategies, 2) text generated by larger models does not exhibit significant resistance to detection; and 3) current detection methods struggle significantly with texts that have undergone secondary operations from both LLM and operations. BRED provides a standardized platform for assessing detector robustness and offers practical insights for advancing AI-generated text detection.
Primary Area: datasets and benchmarks
Submission Number: 18698
Loading