Can Jailbreaks Force Regurgitation? An Investigation into Existing Attacks as a Data Extraction Vector

Can Jailbreaks Force Regurgitation? An Investigation into Existing Attacks as a Data Extraction Vector

ICLR 2026 Conference Submission16003 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: jailbreak attacks, data extraction, memorization, privacy vulnerabilities, verbatim regurgitation, training data leakage, membership inference attacks

TL;DR: Jailbreaks can extract memorized training data from LLMs. Testing 9 models, we show jailbreaks boost extraction success to 100% and reveal smaller models sometimes leak more than larger ones, exposing how safety bypasses compromise privacy.

Abstract: Large Language Models (LLMs) memorize sensitive and copyrighted data, creating legal and ethical risks that threaten the future of generative AI. Jailbreaks, meanwhile, routinely bypass safety guardrails. Prior work has shown only that jailbreaks can surface arbitrary snippets of copyrighted text---academically interesting, but not practically useful. We take the first step further, showing that jailbreaks can be systematically used to extract verbatim memorized data on demand, causing an LLM to regurgitate a target text from its training data. Evaluating across a diverse set of jailbreaks and LLMs, we demonstrate that our attacks can achieve 100% verbatim extraction. Our results reveal an "architecture over size" paradox: smaller models leak more than larger ones, challenging common assumptions about memorization. This is the first work to connect jailbreaks to targeted data extraction, exposing a critical failure mode at the core of today's LLM ecosystem.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 16003

Loading