Can Jailbreaks Force Regurgitation? An Investigation into Existing Attacks as a Data Extraction Vector
Keywords: jailbreak attacks, data extraction, memorization, privacy vulnerabilities, verbatim regurgitation, training data leakage, membership inference attacks
TL;DR: Jailbreaks can extract memorized training data from LLMs. Testing 9 models, we show jailbreaks boost extraction success to 100% and reveal smaller models sometimes leak more than larger ones, exposing how safety bypasses compromise privacy.
Abstract: Large Language Models (LLMs) memorize sensitive and copyrighted data, creating legal and ethical risks that threaten the future of generative AI. Jailbreaks, meanwhile, routinely bypass safety guardrails. Prior work has shown only that jailbreaks can surface arbitrary snippets of copyrighted text---academically interesting, but not practically useful. We take the first step further, showing that jailbreaks can be systematically used to extract verbatim memorized data on demand, causing an LLM to regurgitate a target text from its training data. Evaluating across a diverse set of jailbreaks and LLMs, we demonstrate that our attacks can achieve 100% verbatim extraction. Our results reveal an "architecture over size" paradox: smaller models leak more than larger ones, challenging common assumptions about memorization. This is the first work to connect jailbreaks to targeted data extraction, exposing a critical failure mode at the core of today's LLM ecosystem.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 16003
Loading