Alignment Whack-a-Mole : Finetuning Activates Verbatim Recall of Copyrighted Books in Large Language Models
Keywords: large language models, memorization, copyright, safety alignment
TL;DR: Finetuning LLMs on a plot-summary-to-text task causes verbatim reproduction of up to 85-90% of held-out copyrighted books from parametric memory, revealing that model weights store copies and safety alignment isn't enough to prevent extraction.
Abstract: Frontier LLM companies have assured courts that their models do not store training data and rely on alignment, system prompts, and output filters to block verbatim regurgitation of copyrighted works. We show that finetuning bypasses these protections: training GPT-4o, Gemini-2.5-Pro, and DeepSeek-V3.1 to expand plot summaries into full text causes them to reproduce up to 85-90\% of held-out copyrighted books, with single verbatim spans exceeding 460 words, using only semantic descriptions as prompts. This extraction generalizes across authors: finetuning exclusively on Haruki Murakami's novels unlocks verbatim recall from over 30 unrelated authors, while finetuning on synthetic text yields near-zero extraction, indicating that the task reactivates latent pretraining memorization rather than teaching new content. Three models from different providers memorize the same books in the same regions ($r \ge 0.90$), pointing to an industry-wide vulnerability. Our findings provide evidence that model weights store retrievable copies of copyrighted works and that finetuning-induced extraction undermines a key premise of recent fair use rulings, which have conditioned favorable outcomes on the adequacy of measures preventing reproduction of protected expression.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 52
Loading