Abstract: In Natural Language Processing, detecting adversarial examples is key nowadays. A common method is to use density to detect these attacks, as adversarial examples tends to have a lower density than original ones. To upgrade the maximum likelihood estimator that is commonly used, we apply a Robust Density Estimation method which consists in using the kernel PCA and Minimum Covariance Determinant of our embeddings, inspired from [KiYoon Yoo et al, 2022]. We obtained relevant results with the important IMDB dataset to which we applied the transformer-based model BERT. Our results with RDE showed indeed that the adversarial examples have a lower density than the original ones. Our auc is convincing enough (about 0.9) about the power of this detection model. In future research, it would be interesting to have a model that depends less upon the embeddings it is calibrated on, using the diverse variations of the BERT transformer models for example.
0 Replies
Loading