Beyond Borders: Uncovering Dialectal Arabic Overlaps through Multi-Label Identification

ACL ARR 2025 February Submission3582 Authors

15 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: This paper explores Multi-Label Arabic Dialect Identification, addressing the limitations of single-label classification, which fails to capture the natural overlap between dialects. We use pseudo-labeling to generate multi-label training data and fine-tune BERT-based models to improve dialect classification. Our approach achieves state-of-the-art performance, surpassing previous methods by 7\% in macro F1 score. These results show that allowing multiple dialect labels provides a more accurate representation of real-world language use. However, distinguishing similar dialects remains a challenge, emphasizing the need for better annotation techniques.
Paper Type: Long
Research Area: Multilingualism and Cross-Lingual NLP
Research Area Keywords: multilingualism, linguistic variation, dialects and language varieties, less-resourced languages
Contribution Types: Model analysis & interpretability, Approaches to low-resource settings, Publicly available software and/or pre-trained models
Languages Studied: Arabic
Submission Number: 3582
Loading