From MTEB to MTOB: Retrieval-Augmented Classification for Descriptive Grammars

ACL ARR 2024 December Submission2333 Authors

16 Dec 2024 (modified: 14 Feb 2025)ACL ARR 2024 December SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Recent advances in language modeling have demonstrated significant improvements in zero-shot capabilities, including in-context learning, instruction following, and machine translation for extremely under-resourced languages (Tanzer et al., 2024). However, many languages with limited written resources rely primarily on descriptions of grammar and vocabulary. In this paper, we introduce a set of benchmarks to evaluate how well models can extract and classify information from the complex descriptions found in linguistic grammars. We present a Retrieval-Augmented Generation (RAG)-based approach that leverages these descriptions for downstream tasks such as machine translation. Our benchmarks encompass linguistic descriptions for 248 languages across 142 language families, focusing on typological features from WALS (Dryer and Haspelmath, 2013) and Grambank (Skirgård et al., 2023). This set of benchmarks offers the first comprehensive evaluation of language models’ in-context ability to accurately interpret and extract linguistic features, providing a critical resource for scaling NLP to low-resource languages. The code and data are publicly available at at https://anonymous.4open.science/r/from-MTEB-to-MTOB.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: benchmarking, language resources, datasets for low resource languages, evaluation methodologies, metrics, reproducibility
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches to low-resource settings, Publicly available software and/or pre-trained models, Data resources, Data analysis
Languages Studied: Gã, Hausa, Ket, Hinuq, Emmi, Warrongo, Kutenai, Coast Tsimshian, Enxet Sur, Mosetén, Qaqet, Savosavo, Urarina, Nadëb
Submission Number: 2333
Loading