Robust LLM Unlearning via Post Judgment and Multi-round Thinking

Robust LLM Unlearning via Post Judgment and Multi-round Thinking

ICLR 2026 Conference Submission176 Authors

01 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM Unlearning; Adversarial Robustness; AI Safety

TL;DR: We introduce PoRT, a robust unlearning framework that cleans prompts, jointly judges the question-answer pair, and triggers self-correction for safer outputs.

Abstract: The unlearning capability of LLMs is vital for ensuring compliance and safety, especially when removing sensitive knowledge from deployed models. Pre-filtering methods, enabling rapid deployment without parameter changes, are a prominent unlearning approach. However, they exhibit significant robustness deficiencies against adversarial attacks: in the worst case, simple prefix attacks can induce up to a 1,150-fold surge in information leakage for fictitious entity knowledge, while composite question attacks can cause accuracy on hazardous knowledge to rebound from the 25% random-guess baseline to as high as 67.0%. To address this, we propose a new unlearning framework via post judgment and multi-round thinking (PoRT), which consists of three key modules. First, a data cleaning module compiles a dynamic few-shot prompt that instructs the LLM to simultaneously generate both a cleaned version of the user’s query and a corresponding initial response, supported by an extensible demonstration library for adaptive defense. Second, unlike existing pre-filtering methods that typically judge based solely on prompts, our post-judgment module jointly evaluates cleaned prompts and their corresponding responses to better detect non-compliant outputs. Finally, a selective multi-round thinking process is employed to trigger LLM’s self-correction for low-confidence outputs, enhancing reliability and result quality. Extensive experiments on benchmarks demonstrate PoRT’s superior robustness against adversarial attacks and strong unlearning effectiveness without compromising general model utility.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 176

Loading