- Abstract: Natural Language Processing models lack a unified approach to robustness testing. In this paper we introduce WildNLP - a framework for testing model stability in a natural setting where text corruptions such as keyboard errors or misspelling occur. We compare robustness of models from 4 popular NLP tasks: Q&A, NLI, NER and Sentiment Analysis by testing their performance on aspects introduced in the framework. In particular, we focus on a comparison between recent state-of-the- art text representations and non-contextualized word embeddings. In order to improve robust- ness, we perform adversarial training on se- lected aspects and check its transferability to the improvement of models with various cor- ruption types. We find that the high perfor- mance of models does not ensure sufficient robustness, although modern embedding tech- niques help to improve it. We release cor- rupted datasets and code for WildNLP frame- work for the community.
- Keywords: NLP, robustness, stability, neural networks
- TL;DR: We compare robustness of models from 4 popular NLP tasks: Q&A, NLI, NER and Sentiment Analysis by testing their performance on perturbed inputs.