Beyond Latin Scripts: Performance Gaps in Public Large Language Models for Low-resource Languages

Beyond Latin Scripts: Performance Gaps in Public Large Language Models for Low-resource Languages

ACL ARR 2025 July Submission1284 Authors

29 Jul 2025 (modified: 20 Aug 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large language models (LLMs) have demonstrated exceptional performance across a wide range of natural language processing tasks. However, their capabilities in linguistically diverse, low-resource contexts remain under-explored—particularly for languages that do not use Latin scripts. This study evaluates nine publicly accessible LLMs across 14 low-resource languages (LRLs), encompassing both Latin and non-Latin scripts (e.g., Ge’ez, Devanagari, Cyrillic), focusing on three key tasks: machine translation, text summarization, and question answering. Our analysis reveals significant performance disparities: languages with Latin scripts (e.g., Somali, Swahili, Yoruba) perform better compared to those with non-Latin scripts (e.g., Pashto, Nepali, Sinhala, Amharic), particularly in text summarization, with ROUGE scores differing by up to 39\% across languages. These disparities are strongly correlated with the type of tokenizer used: the majority of tokenizer models in this study are not effective when dealing with languages outside their primary training distribution or those with distinct linguistic features (e.g., non-Latin scripts, complex morphology). This highlights a critical need for language-specific tokenizers—or multilingual tokenizers explicitly designed to accommodate a broader range of linguistic characteristics—for optimal LLM performance on linguistically diverse LRLs.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: Low-resource Languages, Benchmarking Large Language Models, Performance Analysis, Language Scripts

Contribution Types: Model analysis & interpretability

Languages Studied: Somali, Swahili, Yoruba, Pashto, Kannada, Sinhala, Marathi, Punjabi, Tajik, Kyrgyz, Telugu, Amharic, Burmese, Nepali

Reassignment Request Area Chair: This is not a resubmission

Reassignment Request Reviewers: This is not a resubmission

A1 Limitations Section: This paper has a limitations section.

A2 Potential Risks: N/A

B Use Or Create Scientific Artifacts: Yes

B1 Cite Creators Of Artifacts: Yes

B1 Elaboration: Section 4

B2 Discuss The License For Artifacts: Yes

B2 Elaboration: Section 4

B3 Artifact Use Consistent With Intended Use: N/A

B4 Data Contains Personally Identifying Info Or Offensive Content: N/A

B5 Documentation Of Artifacts: Yes

B5 Elaboration: Section A.3

B6 Statistics For Data: Yes

B6 Elaboration: Section A.4

C Computational Experiments: Yes

C1 Model Size And Budget: Yes

C1 Elaboration: Section A.1

C2 Experimental Setup And Hyperparameters: N/A

C3 Descriptive Statistics: Yes

C3 Elaboration: Sections 5&6

C4 Parameters For Packages: Yes

C4 Elaboration: Section 4

D Human Subjects Including Annotators: No

D1 Instructions Given To Participants: N/A

D2 Recruitment And Payment: N/A

D3 Data Consent: N/A

D4 Ethics Review Board Approval: N/A

D5 Characteristics Of Annotators: N/A

E Ai Assistants In Research Or Writing: No

E1 Information About Use Of Ai Assistants: N/A

Author Submission Checklist: yes

Submission Number: 1284

Loading