Benchmarking Public Large Language Models in Low-resource Languages

Benchmarking Public Large Language Models in Low-resource Languages

ACL ARR 2024 June Submission2938 Authors

15 Jun 2024 (modified: 02 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: In recent years, Large Language Models have demonstrated impressive performances, particularly in zero-shot and few-shot learning across various languages. However, these models are often evaluated in English or high-resource languages, with limited focus on low-resource languages. This study benchmarks public LLMs which are commonly used in HuggingFace, including XGLM, Falcon, Llama, mT5-base, BLOOM, Mistral, Pegasus-Xsum, and the fine-tuned variants of mT5 and T5, on benchmark datasets in different low-resource languages. We conducted our experiments to evaluate the performance of these models across three natural language processing tasks: machine translation, text summarization, and question answering. Our evaluation results show a significant variability in performance, highlighting both the strengths and limitations of current multilingual large language models when applied to low-resource languages. Specifically, we observed that language models tend to perform better on languages with Latin alphabet, which is the most widely used in alphabetic writing, compared to those with non-Latin scripts, highlighting the need for more balanced training data.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: Generation, Machine Translation, Multilingualism and Cross-Lingual NLP, Question Answering, Resources and Evaluation, Summarization,

Contribution Types: Model analysis & interpretability, Approaches to low-resource settings, Publicly available software and/or pre-trained models

Languages Studied: Amharic, Hausa, Igbo, Nepali, Somali, Swahili, Tigrinya, Telugu, Xhosa, Yoruba, Zulu, Hindi, Indonesian, Afrikaans, Bengali,Tamil

Submission Number: 2938

Loading