"I know myself better, but not really greatly'': How Well Can LLMs Detect and Explain LLM-Generated Texts?
Abstract: Distinguishing between human- and LLM-generated texts is crucial given the risks associated with misuse of LLMs. This paper investigates detection and explanation capabilities of current LLMs across two settings: binary (human vs. LLM-generated) and ternary classification (including an ``undecided'' class). We evaluate 6 close- and open-source LLMs of varying sizes and find that self-detection (LLMs identifying their own outputs) consistently outperforms cross-detection (identifying outputs from other LLMs), though both remain suboptimal. Introducing a ternary classification framework improves both detection accuracy and explanation quality across all models. Through comprehensive quantitative and qualitative analyses using our human-annotated dataset, we identify key explanation failures, primarily reliance on inaccurate features, hallucinations, and flawed reasoning. Our findings underscore the limitations of current LLMs in self-detection and self-explanation, highlighting the need for further research to address overfitting and enhance generalizability.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: free-text/natural language explanations
Contribution Types: Model analysis & interpretability, Data resources
Languages Studied: English
Keywords: LLM-generated text detection, explainability of LLMs
Submission Number: 2844
Loading