Fooling Pre-trained Language Models: An Evolutionary Approach to Generate Wrong Sentences with High Acceptability Score
Abstract: Large pre-trained language representation models have recently collected numerous successes in language understanding.
They obtained state-of-the-art results in many classical benchmark datasets, such as GLUE benchmark and SQuAD dataset, but do they really understand the language?
In this paper we investigate two among the best pre-trained language models, BERT and RoBERTa, analysing their weaknesses by generating adversarial sentences in an evolutionary approach.
Our goal is to discover if and why it is possible to fool these models, and how to face this issue.
This adversarial attack is followed by a cross analysis, understanding robustness and generalization proprieties of models and fooling techniques.
We find that BERT can be easily fooled, but an augmentation of the original dataset with adversarial samples is enough to make it learn how not to be fooled again. RoBERTa, instead, is more resistent to this approach even if it still have some weak spots.
Code: https://mega.nz/#!4NclyajI!wIhKovxXwa4mGezOD8mcKFAVkL0ZyM_diqmGr5_P87o
Keywords: Pre-trained Language Models, Adversarial Attack, Evolutionary Algorithm, BERT, RoBERTa, CoLA
Original Pdf: pdf
10 Replies
Loading