A baseline for detecting Textual Attacks in Sentiment Analysis Classification using Density Estimation

Vincent Nguyen, Solal Jarreau

21 Mar 2023OpenReview Archive Direct UploadReaders: Everyone

Abstract: Building NLP models that are resistant to computer destabilization has become a key element of research in recent years. While models are becoming more and more reliable and robust, concerns about the exploitation of their flaws involve the construction of tools to guarantee their robustness and to protect against computer attacks. As a result, adversarial defense have been aggressively developed over the past decade, showing convincing results in improving the robustness of models and their resistance to attacks. However, another crucial tool in protecting from attacks is to improve word-adversarial attacks detection. In this paper, we evaluate the performance of two attack detection methods on two prepared datasets and two transformer-based models. Our main goal is to investigate and confirm the results obtained in Yoo et al., using density estimation.

0 Replies