Keywords: AI-Generated Text Detection, Inverse Prompting, Explainability, Large Language Models, Authorship Attribution, Trustworthy NLP
TL;DR: We propose IPAD, an interpretable method to detect LLM-generated text by reconstructing human intent via inverse prompting.
Abstract: Large Language Models (LLMs) have attained human-level fluency in text generation, which complicates the distinguishing between human-written and LLM-generated texts. This increases the risk of misuse and highlights the need for reliable detectors. Yet, existing detectors exhibit poor robustness on out-of-distribution (OOD) data and attacked data, which is critical for real-world scenarios. Also, they struggle to provide interpretable evidence to support their decisions, thus undermining the reliability. In light of these challenges, we propose ~\textbf{IPAD (Inverse Prompt for AI Detection)}, a novel framework consisting of a ~\textbf{Prompt Inverter} that identifies predicted prompts that could have generated the input text, and two ~\textbf{Distinguishers} that examines the probability that the input texts align with the predicted prompts. Empirical evaluations demonstrate that IPAD outperforms the strongest baselines by 9.05\% (Average Recall) on in-distribution data, 12.93\% (AUROC) on out-of-distribution (OOD) data, and 5.48\% (AUROC) on attacked data. IPAD also performs robust on structured datasets. Furthermore, an interpretability assessment is conducted to illustrate that IPAD enhances the AI detection trustworthiness by allowing users to directly examine the decision-making evidence, which provides interpretable support for its state-of-the-art detection results.
Archival Status: Non-archival (not included in proceedings)
Submission Number: 54
Loading