Abstract: Large language models (LLMs) have demonstrated exceptional performance across a wide range of natural language processing tasks. However, their capabilities in linguistically diverse, low-resource contexts remain under-explored—particularly for languages that do not use Latin scripts. This study evaluates nine publicly accessible LLMs across 14 low-resource languages (LRLs), encompassing both Latin and non-Latin scripts (e.g., Ge’ez, Devanagari, Cyrillic), focusing on three key tasks: machine translation, text summarization, and question answering. Our analysis reveals significant performance disparities: languages with Latin scripts (e.g., Somali, Swahili, Yoruba) perform better compared to those with non-Latin scripts (e.g., Pashto, Nepali, Sinhala, Amharic), particularly in text summarization, with ROUGE scores differing by up to 39\% across languages. These disparities are strongly correlated with the type of tokenizer used: the majority of tokenizer models in this study are not effective when dealing with languages outside their primary training distribution or those with distinct linguistic features (e.g., non-Latin scripts, complex morphology). This highlights a critical need for language-specific tokenizers—or multilingual tokenizers explicitly designed to accommodate a broader range of linguistic characteristics—for optimal LLM performance on linguistically diverse LRLs.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: Low-resource Languages, Benchmarking Large Language Models, Performance Analysis, Language Scripts
Contribution Types: Model analysis & interpretability
Languages Studied: Somali, Swahili, Yoruba, Pashto, Kannada, Sinhala, Marathi, Punjabi, Tajik, Kyrgyz, Telugu, Amharic, Burmese, Nepali
Reassignment Request Area Chair: This is not a resubmission
Reassignment Request Reviewers: This is not a resubmission
A1 Limitations Section: This paper has a limitations section.
A2 Potential Risks: N/A
B Use Or Create Scientific Artifacts: Yes
B1 Cite Creators Of Artifacts: Yes
B1 Elaboration: Section 4
B2 Discuss The License For Artifacts: Yes
B2 Elaboration: Section 4
B3 Artifact Use Consistent With Intended Use: N/A
B4 Data Contains Personally Identifying Info Or Offensive Content: N/A
B5 Documentation Of Artifacts: Yes
B5 Elaboration: Section A.3
B6 Statistics For Data: Yes
B6 Elaboration: Section A.4
C Computational Experiments: Yes
C1 Model Size And Budget: Yes
C1 Elaboration: Section A.1
C2 Experimental Setup And Hyperparameters: N/A
C3 Descriptive Statistics: Yes
C3 Elaboration: Sections 5&6
C4 Parameters For Packages: Yes
C4 Elaboration: Section 4
D Human Subjects Including Annotators: No
D1 Instructions Given To Participants: N/A
D2 Recruitment And Payment: N/A
D3 Data Consent: N/A
D4 Ethics Review Board Approval: N/A
D5 Characteristics Of Annotators: N/A
E Ai Assistants In Research Or Writing: No
E1 Information About Use Of Ai Assistants: N/A
Author Submission Checklist: yes
Submission Number: 1284
Loading