Memorization Removal as a Two-Player Game: The Adversarial Work Criterion as a Test for Foundation-Model Defenses

Published: 04 Jun 2026, Last Modified: 04 Jun 2026ICML MemFM 2026 Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: memorization, reinforcement learning
Abstract: Recent work on memorization in diffusion models—Finding NeMo (Hintersdorf et al., 2024) and its follow-up Finding Dori (Kowalczuk et al., 2025)—presents a striking empirical pattern: a defense that suppresses memorized generation under the original training prompt can be defeated by adversarial embeddings, even though the defense “works” on every standard benchmark. We argue that this is not a contingent failure of NeMo or any specific localization method, but a structural consequence of evaluating memorization defenses against fixed prompts rather than against an adversary. We propose that the field adopt an Adversarial Work Criterion (AWC) that quantifies the computational work required to elicit memorized content from a frozen defended model, and that a defense be called effective only when this work scales exponentially in the resources of a bounded adversary. The AWC complements differential privacy (informationtheoretic, distribution-level) and membershipinference benchmarks (single-adversary, singlebudget) by providing a per-model, per-datum, computational lower bound. We give a toy energylandscape calculation showing that the AWC formally classifies NeMo-style local patches alongside generic gradient obfuscation—both scoring near zero—while reserving polynomial scores for defenses that genuinely flatten the memorization basin; this recovers the empirical finding of Finding Dori from the AWC formalism. The position is normative; we are honest about what is conjectural and what is provable
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 78
Loading