BaFair: Backdoored Fairness Attacks with Group-conditioned Triggers

ACL ARR 2024 June Submission4262 Authors

16 Jun 2024 (modified: 02 Aug 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Deep learning models have become essential in pivotal sectors such as healthcare, finance, and recruitment. However, they are not without risks; biases and unfairness inherent in these models could harm those who depend on them. Although there are algorithms designed to enhance fairness, the resilience of these models against hostile attacks, especially the emerging threat of Trojan (aka backdoor) attacks, is not thoroughly investigated. To bridge this research gap, we present *BajFair*, a Trojan fairness attack methodology. BaFair stealthily crafts a model that operates with accuracy and fairness under regular conditions but, when activated by certain triggers, discriminates and produces incorrect results for specific groups. This type of attack is particularly stealthy and dangerous as it circumvents existing fairness detection methods, maintaining an appearance of fairness in normal use. Our findings reveal that BaFair achieves a remarkable success rate of 88.7% in attacks aimed at targeted groups on average, while only incurring a minimal average accuracy loss of less than 1.2%. Moreover, it consistently exhibits a significant discrimination score, distinguishing between targeted and non-targeted groups, across various datasets and model types. **Content Warning**: This article only analyzes offensive language for academic purposes. Discretion is advised.
Paper Type: Long
Research Area: Ethics, Bias, and Fairness
Research Area Keywords: Model analysis & interpretability
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 4262
Loading