Large Language Models Are Better Adversaries: Exploring Generative Clean-Label Backdoor Attacks Against Text Classifiers

Wencong You; Zayd Hammoudeh; Daniel Lowd

Large Language Models Are Better Adversaries: Exploring Generative Clean-Label Backdoor Attacks Against Text Classifiers

Wencong You, Zayd Hammoudeh, Daniel Lowd

Published: 07 Oct 2023, Last Modified: 01 Dec 2023EMNLP 2023 FindingsEveryoneRevisionsBibTeX

Submission Type: Regular Long Paper

Submission Track: Theme Track: Large Language Models and the Future of NLP

Submission Track 2: Machine Learning for NLP

Keywords: adversarial machine learning, backdoor attacks, large language models, natural language processing

TL;DR: This paper studies how LLMs can facilitate clean-label backdoor attacks on text classifiers, and how to defend against them.

Abstract: Backdoor attacks manipulate model predictions by inserting innocuous triggers into training and test data. We focus on more realistic and more challenging clean-label attacks where the adversarial training examples are correctly labeled. Our attack, LLMBkd, leverages language models to automatically insert diverse style-based triggers into texts. We also propose a poison selection technique to improve the effectiveness of both LLMBkd as well as existing textual backdoor attacks. Lastly, we describe REACT, a baseline defense to mitigate backdoor attacks via antidote training examples. Our evaluations demonstrate LLMBkd's effectiveness and efficiency, where we consistently achieve high attack success rates across a wide range of styles with little effort and no model training.

Submission Number: 5285

Loading