Generalized Inference Time Unlearning --- Effective for A Fraction of the Cost

ICLR 2026 Conference Submission21686 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Unlearning, memorization, privacy, safety, generative models
TL;DR: Using two small models to guide generation from a large model, we achieve comparable unlearning performance across benchmarks without sacrificing utility.
Abstract: Large Language Models (LLMs) can memorize and regurgitate sensitive training data, creating significant privacy and safety risks. While existing unlearning aim to address these risks, current methods are often computationally prohibitive and/or significantly degrade model utility. We introduce a framework for Inference-Time Unlearning, a new paradigm that steers an LLM's output at inference time using small secondary models, without altering the base model's weights. Through extensive experiments with LLMs we demonstrate that our method is highly effective at removing targeted verbatim and semantic knowledge, is orders of magnitude more computationally efficient than traditional approaches, and fully preserves the base model's general capabilities. We then explore efficacy in unlearning visual semantics in generative image models and find similar evidence of effectiveness. Finally, we introduce a new benchmark focused on unlearning time-dependent information. Collectively, the framework offers a practical, scalable, and low-cost solution for selective forgetting, enabling more responsible and adaptable model deployment. All code to reproduce this work is available at the following anonymous link: https://anonymous.4open.science/r/inference-time-unlearning-iclr2026/
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 21686
Loading