Abstract: This paper introduces Swan, a family of cutting-edge embedding models specialized for Arabic language understanding. We present two models, namely Swan-Base and Swan-Large, which are further trained using a large-scale synthetic corpus. To comprehensively evaluate our models, we introduce an extensive text evaluation benchmark, dubbed ArabicMTEB. ArabicMTEB is the largest Arabic text embedding evaluation benchmark to date, covering eight tasks across 74 diverse datasets. Additionally, we propose ArabicMTEBLite, a lightweight and domain-specific synthetic dataset designed for holistic evaluation. Our experiments reveal that Swan-Large exhibits remarkable text embedding capabilities, consistently outperforming all open source models including, Multilingual-E5-large, across all tasks. Furthermore, our efficient model, Swan-Base, also surpasses Multilingual-E5-base in all evaluated tasks. We also explore the impact of synthetic data and the number of hard negatives on the performance of Swan-Base and Swan-Large. Our findings demonstrate that Swan-Base offers an optimal balance between performance, inference time, and cost. Our models will be made publicly accessible for research.
Paper Type: Long
Research Area: Multilingualism and Cross-Lingual NLP
Research Area Keywords: multilingual extraction,passage retrieval,code-switching, mixed language, multilingualism, multilingual representations,multilingual benchmarks, multilingual evaluation, dialects and language varieties
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches to low-resource settings, Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models
Languages Studied: Arabic,English, German, Spanish, Chinese, Vietnamese,Hindi
Submission Number: 3517
Loading