Abstract: Recent work has introduced several families of neural retrieval approaches that use transformer-based pre-trained language models to improve multilingual and cross-lingual retrieval. Their effectiveness for low-resource, morphologically rich languages such as Amharic remains underexplored and often limited due to data scarcity and suboptimal tokenization. We address this gap by introducing Amharic-specific dense retrieval models based on pre-trained Amharic BERT and RoBERTa architectures. Our proposed RoBERTa-Base-Amharic-Embed (with a modest 110M parameters) outperforms the strongest multilingual model, Arctic Embed 2.0 (568M parameters), with a 5.01\% relative improvement in MRR@10 and a 3.34\% gain in Recall@10. Even more compact variants that we introduce, such as RoBERTa-Medium-Amharic-Embed (with just 42M parameters), remain competitive despite being 14x smaller. We benchmark our proposed models against sparse and dense retrieval approaches to systematically evaluate retrieval performance in Amharic. We reveal fundamental challenges in low-resource settings, underscoring the need for language-specific adaptation. Our work demonstrates the importance of optimizing retrieval models for morphologically complex languages and establishes a strong foundation for future research. To facilitate further advancements in low-resource information retrieval, we release our dataset, codebase, and trained models at https://github.com/amharic-ir-resources/Amharic-dense-retrival-models.
Paper Type: Long
Research Area: Information Retrieval and Text Mining
Research Area Keywords: Passage Retrieval, Dense Retrieval, Information Retrieval, Benchmarking, Retrieval Evaluation, Low-resource NLP, Contrastive Learning, Multilingual NLP, Fine-tuning
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Reproduction study, Approaches to low-resource settings, Publicly available software and/or pre-trained models
Languages Studied: Amharic
Submission Number: 4376
Loading