Keywords: security and privacy, membership inference
Abstract: Large Generative Models (LGMs) process user queries by conditioning on diverse contextual information, which may inadvertently include sensitive data such as passwords or personally identifiable information (PII).
A privacy risk arises when the model’s outputs are \emph{unintentionally} influenced by this sensitive context.
Even subtle shifts in the output distribution can create a silent leakage channel.
Unlike direct data exposure, this leakage is encoded in seemingly innocuous generations, evading defenses that only block verbatim reproduction of sensitive content. We present a novel attack framework that leverages high-fidelity surrogate models to decode sensitive information from a target model’s context.
Importantly, our attacks succeed even when the model behaves as intended and without exploiting explicit security vulnerabilities (\eg, through jailbreaking).
We design two attack variants: (i) an \emph{undetectable attack} that passively analyzes benign generations, and (ii) an \emph{adaptive attack} that strategically selects queries to maximize information gain. Our findings show that optimized queries achieve up to 100\% attack success rates across models and remain effective under instruction-based defenses.
This work highlights the urgent need for defenses capable of detecting and mitigating private information leakage during inference.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 7576
Loading