Optimized Text Embedding Models and Benchmarks for Amharic Passage Retrieval

Optimized Text Embedding Models and Benchmarks for Amharic Passage Retrieval

ACL ARR 2025 February Submission4376 Authors

15 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Recent work has introduced several families of neural retrieval approaches that use transformer-based pre-trained language models to improve multilingual and cross-lingual retrieval. Their effectiveness for low-resource, morphologically rich languages such as Amharic remains underexplored and often limited due to data scarcity and suboptimal tokenization. We address this gap by introducing Amharic-specific dense retrieval models based on pre-trained Amharic BERT and RoBERTa architectures. Our proposed RoBERTa-Base-Amharic-Embed (with a modest 110M parameters) outperforms the strongest multilingual model, Arctic Embed 2.0 (568M parameters), with a 5.01\% relative improvement in MRR@10 and a 3.34\% gain in Recall@10. Even more compact variants that we introduce, such as RoBERTa-Medium-Amharic-Embed (with just 42M parameters), remain competitive despite being 14x smaller. We benchmark our proposed models against sparse and dense retrieval approaches to systematically evaluate retrieval performance in Amharic. We reveal fundamental challenges in low-resource settings, underscoring the need for language-specific adaptation. Our work demonstrates the importance of optimizing retrieval models for morphologically complex languages and establishes a strong foundation for future research. To facilitate further advancements in low-resource information retrieval, we release our dataset, codebase, and trained models at https://github.com/amharic-ir-resources/Amharic-dense-retrival-models.

Paper Type: Long

Research Area: Information Retrieval and Text Mining

Research Area Keywords: Passage Retrieval, Dense Retrieval, Information Retrieval, Benchmarking, Retrieval Evaluation, Low-resource NLP, Contrastive Learning, Multilingual NLP, Fine-tuning

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Reproduction study, Approaches to low-resource settings, Publicly available software and/or pre-trained models

Languages Studied: Amharic

Submission Number: 4376

Loading