"I know myself better, but not really greatly'': How Well Can LLMs Detect and Explain LLM-Generated Texts?

"I know myself better, but not really greatly'': How Well Can LLMs Detect and Explain LLM-Generated Texts?

ACL ARR 2025 May Submission2844 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Distinguishing between human- and LLM-generated texts is crucial given the risks associated with misuse of LLMs. This paper investigates detection and explanation capabilities of current LLMs across two settings: binary (human vs. LLM-generated) and ternary classification (including an ``undecided'' class). We evaluate 6 close- and open-source LLMs of varying sizes and find that self-detection (LLMs identifying their own outputs) consistently outperforms cross-detection (identifying outputs from other LLMs), though both remain suboptimal. Introducing a ternary classification framework improves both detection accuracy and explanation quality across all models. Through comprehensive quantitative and qualitative analyses using our human-annotated dataset, we identify key explanation failures, primarily reliance on inaccurate features, hallucinations, and flawed reasoning. Our findings underscore the limitations of current LLMs in self-detection and self-explanation, highlighting the need for further research to address overfitting and enhance generalizability.

Paper Type: Long

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: free-text/natural language explanations

Contribution Types: Model analysis & interpretability, Data resources

Languages Studied: English

Keywords: LLM-generated text detection, explainability of LLMs

Submission Number: 2844

Loading