Open Models Can Silently Undermine Privacy: Context Inference Attacks without Jailbreaks

16 Sept 2025 (modified: 07 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: security and privacy, membership inference
Abstract: Large Generative Models (LGMs) process user queries by conditioning on diverse contextual information, which may inadvertently include sensitive data such as passwords or personally identifiable information (PII). A privacy risk arises when the model’s outputs are \emph{unintentionally} influenced by this sensitive context. Even subtle shifts in the output distribution can create a silent leakage channel. Unlike direct data exposure, this leakage is encoded in seemingly innocuous generations, evading defenses that only block verbatim reproduction of sensitive content. We present a novel attack framework that leverages high-fidelity surrogate models to decode sensitive information from a target model’s context. Importantly, our attacks succeed even when the model behaves as intended and without exploiting explicit security vulnerabilities (\eg, through jailbreaking). We design two attack variants: (i) an \emph{undetectable attack} that passively analyzes benign generations, and (ii) an \emph{adaptive attack} that strategically selects queries to maximize information gain. Our findings show that optimized queries achieve up to 100\% attack success rates across models and remain effective under instruction-based defenses. This work highlights the urgent need for defenses capable of detecting and mitigating private information leakage during inference.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 7576
Loading