PromptFix: Few-shot Backdoor Removal via Adversarial Prompt Tuning

23 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX
Keywords: backdoor mitigation, backdoor removal, adversarial training, prompt tuning
Abstract: Pre-trained language models (PLMs) have attracted tons of attention over the past few years with their unparalleled performances. Meanwhile, the soaring cost to train PLMs and their amazing generalizability have contributed to few-shot fine-tuning and prompting as the most popular training paradigms for natural language processing (NLP) models. However, existing studies have shown that these NLP models can be backdoored such that model behavior is manipulated when the trigger tokens are presented. In this paper, we propose PromptFix, a novel backdoor mitigation strategy for NLP models via adversarial prompt-tuning in few-shot settings. Unlike existing NLP backdoor removal methods, which rely on accurate trigger inversion and subsequent model fine-tuning, PromptFix keeps the model parameters intact and only utilizes two extra sets of soft tokens which approximate the trigger and counteract it respectively. The use of soft tokens and adversarial optimization eliminates the need to enumerate possible backdoor configurations and enables an adaptive balance between trigger finding and preservation of performance. Experiments with various backdoor attacks validate the effectiveness of the proposed method. The performances when domain shift is present further shows PromptFix's applicability to pretrained models on unknown data which is common in prompt tuning scenarios.
Primary Area: general machine learning (i.e., none of the above)
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 6574
Loading