Keywords: LLM Unlearning; Adversarial Robustness; AI Safety
TL;DR: We introduce PoRT, a robust unlearning framework that cleans prompts, jointly judges the question-answer pair, and triggers self-correction for safer outputs.
Abstract: The unlearning capability of LLMs is vital for ensuring compliance and safety, especially when removing sensitive knowledge from deployed models. Pre-filtering methods, enabling rapid deployment without parameter changes, are a prominent unlearning approach. However, they exhibit significant robustness deficiencies against adversarial attacks: in the worst case, simple prefix attacks can induce up to a 1,150-fold surge in information leakage for fictitious entity knowledge, while composite question attacks can cause accuracy on hazardous knowledge to rebound from the 24.9% random-guess baseline to as high as 67.0%. To address this, we propose a new unlearning framework via post judgment and multi-round thinking (PoRT), which consists of three key modules. First, a data cleaning module compiles a dynamic few-shot prompt that instructs the LLM to simultaneously generate both a cleaned version of the user's query and a corresponding initial response, supported by an extensible demonstration library for adaptive defense. Second, unlike existing pre-filtering methods that typically judge based solely on prompts, our post-judgment module jointly evaluates cleaned prompts and their corresponding responses to better detect non-compliant outputs. Finally, a selective multi-round thinking process is employed to trigger LLM's self-correction for low-confidence outputs, enhancing reliability and result quality. Extensive experiments on benchmarks demonstrate PoRT's superior robustness against adversarial attacks and strong unlearning effectiveness without compromising general model utility. Code is available at https://github.com/ChnIRuI/PoRT_LLM_Unlearning
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 176
Loading