IPAD: Inverse Prompt for AI Detection - A Robust and Explainable LLM-Generated Text Detector

IPAD: Inverse Prompt for AI Detection - A Robust and Explainable LLM-Generated Text Detector

ACL ARR 2025 February Submission5091 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large Language Models (LLMs) have attained human-level fluency in text generation, which complicates the distinguishing between human-written and LLM-generated texts. This increases the risk of misuse and highlights the need for reliable detectors. Yet, existing detectors exhibit poor robustness on out-of-distribution (OOD) data and attacked data, which is critical for real-world scenarios. Also, they struggle to provide explainable evidence to support their decisions, thus undermining the reliability. In light of these challenges, we propose IPAD (Inverse Prompt for AI Detection), a novel framework consisting of a Prompt Inverter that identifies predicted prompts that could have generated the input text, and a Distinguisher that examines how well the input texts align with the predicted prompts. We develop and examine two versions of Distinguishers: the Prompt-Text Consistency Verifier that verifies whether the predicted prompt could plausibly generate the input text, and the Regeneration Comparator that evaluates the similarity between the regenerated texts and the input texts. Empirical evaluations demonstrate that both Distinguishers significantly surpass baseline methods, with the Regeneration Comparator outperforming baselines by 9.73% on in-distribution data (F1-score) and 12.65% on OOD data (AUROC). Furthermore, a user study is conducted to illustrate that IPAD enhances the AI detection trustworthiness by allowing users to directly examine the decision-making evidence which provide interpretable support for its state-of-the-art detection results.

Paper Type: Long

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: AI Detection, Prompt Inversion, Large Language Models, Explainability, AI Safety

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Publicly available software and/or pre-trained models, Data resources

Languages Studied: English

Submission Number: 5091

Loading