Examining the Effects of Preprocessing on the Detection of Offensive Language in German Tweets

Sebastian Reimann, Daniel Dakota

14 Oct 2021OpenReview Archive Direct UploadReaders: Everyone

Abstract: Preprocessing is essential for creating more effective features and reducing noise in classification, especially in user-generated data (e.g. Twitter). How each individual preprocessing decision changes an individual classifier’s behavior is not universal. We perform a series of ablation experiments in which we examine how classifiers behave based on individual preprocessing steps when detecting offensive language in German. While preprocessing decisions for traditional classifier approaches are not as varied, we note that pre-trained BERT models are far more sensitive to each decision and do not behave identically to each other. We find that the cause of much variation between classifiers has to do with the interactions specific preprocessing steps have on the overall vocabulary distributions, and, in the case of BERT models, how this interacts with the WordPiece tokenization.

0 Replies