everyone
since 09 May 2025">EveryoneRevisionsBibTeXCC BY 4.0
The exposure of large language models (LLMs) to copyrighted material during pre-training raises practical concerns about unintentional copyright infringement during deployment. This has driven the development of "copyright takedown" methods—post-training approaches aimed at preventing models from generating copyrighted content. We extend this task and specifically target the removal of long quotes from copyrighted sources. We propose BloomScrub, a frustratingly simple yet highly effective approach that provides certified copyright takedown. Our method repeatedly interleaves quote detection with rewriting techniques to transform potentially infringing segments. By leveraging efficient data representations (Bloom filters), our approach enables adaptable and scalable copyright screening—even for large-scale real world corpora. Moreover, our approach offers certified risk reduction: when quotes beyond a length threshold cannot be removed, the system can abstain from responding. Experimental results show that BloomScrub reduces risk, preserves utility, and accommodates different levels of enforcement stringency with adaptive abstention. Our results suggest that lightweight, inference-time methods can be surprisingly effective for copyright prevention.