Fooling Pre-trained Language Models: An Evolutionary Approach to Generate Wrong Sentences with High Acceptability Score

Marco Di Giovanni; Marco Brambilla

Fooling Pre-trained Language Models: An Evolutionary Approach to Generate Wrong Sentences with High Acceptability Score

Marco Di Giovanni, Marco Brambilla

25 Sept 2019 (modified: 05 May 2023)ICLR 2020 Conference Withdrawn SubmissionReaders: Everyone

Abstract: Large pre-trained language representation models have recently collected numerous successes in language understanding. They obtained state-of-the-art results in many classical benchmark datasets, such as GLUE benchmark and SQuAD dataset, but do they really understand the language? In this paper we investigate two among the best pre-trained language models, BERT and RoBERTa, analysing their weaknesses by generating adversarial sentences in an evolutionary approach. Our goal is to discover if and why it is possible to fool these models, and how to face this issue. This adversarial attack is followed by a cross analysis, understanding robustness and generalization proprieties of models and fooling techniques. We find that BERT can be easily fooled, but an augmentation of the original dataset with adversarial samples is enough to make it learn how not to be fooled again. RoBERTa, instead, is more resistent to this approach even if it still have some weak spots.

Code: https://mega.nz/#!4NclyajI!wIhKovxXwa4mGezOD8mcKFAVkL0ZyM_diqmGr5_P87o

Keywords: Pre-trained Language Models, Adversarial Attack, Evolutionary Algorithm, BERT, RoBERTa, CoLA

Original Pdf: pdf

10 Replies

Loading