Memorization Removal as a Two-Player Game: The Adversarial Work Criterion as a Test for Foundation-Model Defenses
Keywords: memorization, reinforcement learning
Abstract: Recent work on memorization in diffusion
models—Finding NeMo (Hintersdorf et al., 2024)
and its follow-up Finding Dori (Kowalczuk et al.,
2025)—presents a striking empirical pattern: a
defense that suppresses memorized generation
under the original training prompt can be defeated by adversarial embeddings, even though
the defense “works” on every standard benchmark. We argue that this is not a contingent failure of NeMo or any specific localization method,
but a structural consequence of evaluating memorization defenses against fixed prompts rather
than against an adversary. We propose that the
field adopt an Adversarial Work Criterion (AWC)
that quantifies the computational work required
to elicit memorized content from a frozen defended model, and that a defense be called effective only when this work scales exponentially in
the resources of a bounded adversary. The AWC
complements differential privacy (informationtheoretic, distribution-level) and membershipinference benchmarks (single-adversary, singlebudget) by providing a per-model, per-datum,
computational lower bound. We give a toy energylandscape calculation showing that the AWC formally classifies NeMo-style local patches alongside generic gradient obfuscation—both scoring
near zero—while reserving polynomial scores for
defenses that genuinely flatten the memorization
basin; this recovers the empirical finding of Finding Dori from the AWC formalism. The position
is normative; we are honest about what is conjectural and what is provable
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 78
Loading