BadActs: A Universal Backdoor Defense in the Activation Space

Anonymous

BadActs: A Universal Backdoor Defense in the Activation Space

Anonymous

16 Feb 2024ACL ARR 2024 February Blind SubmissionReaders: Everyone

Abstract: Backdoor attacks pose an increasingly severe security threat to Deep Neural Networks (DNNs) during their development stage. In response, backdoor sample purification has emerged as a promising defense mechanism, aiming to eliminate backdoor triggers while preserving the integrity of the clean content in the samples. However, existing approaches have been predominantly focused on the word space, which are ineffective against feature-space triggers and significantly impair performance on clean data. To address this, we introduce a universal backdoor defense that purifies backdoor samples in the activation space by drawing abnormal activations towards optimized minimum clean activation distribution intervals. The advantages of our approach are twofold: (1) By operating in the activation space, our method captures from surface-level information like words to higher-level semantic concepts such as syntax, thus counteracting diverse triggers; (2) the fine-grained continuous nature of the activation space allows for more precise preservation of clean content while removing triggers. Furthermore, we propose a detection module based on statistical information of abnormal activations, to achieve a better trade-off between clean accuracy and defending performance. Extensive experiments on diverse datasets and against diverse attacks (including syntax and style attacks) demonstrate that our defense achieves state-of-the-art performance.

Paper Type: long

Research Area: Machine Learning for NLP

Contribution Types: Model analysis & interpretability, NLP engineering experiment

Languages Studied: English

Preprint Status: We plan to release a non-anonymous preprint in the next two months (i.e., during the reviewing process).

A1: yes

A1 Elaboration For Yes Or No: The "limitations" section on Page 9.

A2: yes

A2 Elaboration For Yes Or No: The "Ethics Statement" section on Page 9.

A3: yes

A3 Elaboration For Yes Or No: The "abstract" section on Page 1 and the "introduction" section on Page 1 and Page 2.

B: no

B6: yes

B6 Elaboration For Yes Or No: Appendix A

C: yes

C1: yes

C1 Elaboration For Yes Or No: Section 4.1 and Appendix E

C2: yes

C2 Elaboration For Yes Or No: Section 4.1

C4: yes

C4 Elaboration For Yes Or No: Appendix E

D: no

E: no

0 Replies

Loading