Benchmarking Language Models for Offensive Sentences Classification in Offensive Nepali Roman Multi-Label Dataset
Abstract: This paper presents a comprehensive methodology for benchmarking and evaluating multiple language models to detect offensive language in Romanized Nepali text. Recognizing Nepali as a low-resource language, we introduce the Offensive Nepali Roman Multi-label Dataset (ONRMD), labeled for abuse, scam, sexual, and neutral content,specifically designed for this study. We employ various models, including BERT-base-multilingual-cased, RoBERTa-base, distilbert-base-nepali, FastText, and LASER + CNN, and compare their performance on the ONRMD. Our approach encompasses thorough preprocessing and tokenization of the dataset, followed by training and evaluation using standard metrics such as accuracy, precision, recall, and F1 score. Additionally, we conduct human evaluations with two distinct groups to further validate our findings, given the novelty of our dataset and the absence of a standard baseline. The results demonstrate the potential of these models in effectively handling the nuances of Romanized Nepali text for offensive language detection. This study serves as a foundation for future research involving other pre-trained language models and multilingual datasets.
Paper Type: Short
Research Area: Resources and Evaluation
Research Area Keywords: datasets for low resource languages,benchmarking,multilingual corpora,
Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency, Data resources, Data analysis
Languages Studied: english, romanized-nepali
Submission Number: 2880
Loading