GATE: General Arabic Text Embedding for Enhanced Semantic Textual Similarity with Matryoshka Representation Learning and Hybrid Loss Training

ACL ARR 2026 January Submission6464 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Sentence Embeddings; Semantic Textual Similarity; Matryoshka Representation Learning; Hybrid Loss Training
Abstract: Semantic textual similarity (STS) underpins retrieval, clustering, and semantic understanding, yet remains underexplored for Arabic due to limited high-quality datasets and sentence embedding models. We introduce \textbf{GATE}, a General Arabic Text Embedding framework that achieves state-of-the-art performance on Arabic STS benchmarks within MTEB. GATE integrates Matryoshka Representation Learning with a novel hybrid training strategy that combines STS ranking and NLI classification using Arabic triplet data. This design yields embeddings that capture fine-grained Arabic semantics while remaining robust across multiple embedding dimensions. Despite using significantly fewer parameters, GATE outperforms larger multilingual and proprietary models by 20-25\% on Arabic STS. Beyond accuracy, GATE enables efficient deployment through variable-dimension embeddings, offering strong performance retention at reduced dimensionalities suitable for resource-constrained settings. Our results demonstrate that combining semantic-rich supervision with multi-dimensional representations provides a practical and effective solution for Arabic sentence embeddings.
Paper Type: Long
Research Area: Semantics: Lexical, Sentence-level Semantics, Textual Inference and Other areas
Research Area Keywords: Semantics: Lexical and Sentence-Level, Machine Learning for NLP, Efficient/Low-Resource Methods for NLP, Resources and Evaluation
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches to low-resource settings
Languages Studied: Arabic
Submission Number: 6464
Loading