“I’ve Decided to Leak”: Probing Internals Behind Prompt Leakage Intents

ACL ARR 2025 May Submission5316 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large language models (LLMs) exhibit prompt leakage vulnerabilities, where they may be coaxed into revealing system prompts embedded in LLM services, raising intellectual property and confidentiality concerns. An intriguing question arises: Do LLMs genuinely internalize prompt leakage intents in their hidden states before generating tokens? In this work, we use probing techniques to capture LLMs' intent-related internal representations and confirm that the answer is yes. We start by comprehensively inducing prompt leakage behaviors across diverse system prompts, attack queries, and decoding methods. We develop a hybrid labeling pipeline, enabling the identification of broader prompt leakage behaviors beyond mere verbatim leaks. Our results show that a simple linear probe can predict prompt leakage risks from pre-generation hidden states without generating any tokens. Across all tested models, linear probes consistently achieve 90\%+ AUROC, even when applied to new system prompts and attacks. Understanding the model internals behind prompt leakage drives practical applications, including intention-based detection of prompt leakage risks. Code is available at: https://anonymous.4open.science/r/Probing-leakage-intents.
Paper Type: Long
Research Area: Language Modeling
Research Area Keywords: security and privacy
Contribution Types: Model analysis & interpretability
Languages Studied: English,Chinese
Keywords: security and privacy
Submission Number: 5316
Loading