Abstract: Large language models typically rely on highly curated datasets that lack common irregularities such as typos and contractions, resulting in a mismatch between their training environments and real-world applications. This study evaluates the resilience of four prominent models in five different NLP tasks when confronted with perturbed inputs. We investigate three categories of perturbations: character-level, word-level and miscellaneous perturbations. By comparing performance on original and altered datasets, our results reveal a significant sensitivity to input perturbations across all models, with varying degrees of vulnerability depending on both the specific task and the type of perturbation. In particular, the XLNet model consistently shows superior robustness, while tasks involving grammatical coherence are most adversely affected.
Loading