Evaluating LLMs' Multilingual Capabilities for Bengali: Benchmark Creation and  Performance Analysis

Evaluating LLMs' Multilingual Capabilities for Bengali: Benchmark Creation and Performance Analysis

ACL ARR 2025 July Submission445 Authors

28 Jul 2025 (modified: 20 Aug 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Bengali is an underrepresented language in NLP research. However, it remains a challenge due to its unique linguistic structure and computational constraints. In this work, we systematically investigate the challenges that hinder Bengali NLP performance by focusing on the absence of standardized evaluation benchmarks. We then evaluated 10 recent open source Large Language Models (LLMs) in 8 of the translated datasets and performed a comprehensive error analysis to pinpoint their primary failure modes. Our findings reveal consistent performance gaps for Bengali compared to English, particularly for smaller models and specific model families like Mistral. We also identified promising robustness in certain architectures, such as DeepSeek, that maintain more stable performance across languages. Our analysis reveals an inverse relationship between tokenization efficiency and LLM accuracy where models tend to perform worse when inputs are excessively tokenized, whereas more efficient \& concise tokenization results in improved performance. These findings highlight critical areas where current models fall short and underscore the need for improved dataset quality and evaluation methodologies tailored to multilingual contexts. This work will catalyze further research on NLP for underrepresented languages, helping to democratize access to advanced language technologies worldwide.

Paper Type: Long

Research Area: Multilingualism and Cross-Lingual NLP

Research Area Keywords: Large Language Models, Bengali NLP, Low-resource languages, Multilingual evaluation, Benchmark datasets, Machine Translation, Tokenization

Contribution Types: Approaches to low-resource settings, Publicly available software and/or pre-trained models, Data resources, Data analysis

Languages Studied: English, Bengali

Reassignment Request Area Chair: This is not a resubmission

Reassignment Request Reviewers: This is not a resubmission

A1 Limitations Section: This paper has a limitations section.

A2 Potential Risks: N/A

B Use Or Create Scientific Artifacts: Yes

B1 Cite Creators Of Artifacts: Yes

B1 Elaboration: References

B2 Discuss The License For Artifacts: Yes

B2 Elaboration: Apache 2.0 License

B3 Artifact Use Consistent With Intended Use: Yes

B3 Elaboration: Conclusion. Section 5.

B4 Data Contains Personally Identifying Info Or Offensive Content: No

B4 Elaboration: The datasets sourced are personal identifier free.

B5 Documentation Of Artifacts: Yes

B5 Elaboration: Github repository.

B6 Statistics For Data: Yes

B6 Elaboration: Section 2

C Computational Experiments: Yes

C1 Model Size And Budget: Yes

C1 Elaboration: Section 2

C2 Experimental Setup And Hyperparameters: Yes

C2 Elaboration: Translation related parameters were discussed in section 2.2.

C3 Descriptive Statistics: Yes

C3 Elaboration: Result Analysis. Section 2.5

C4 Parameters For Packages: Yes

C4 Elaboration: Experiment Details . Section - 3

D Human Subjects Including Annotators: No

D1 Instructions Given To Participants: N/A

D2 Recruitment And Payment: N/A

D3 Data Consent: N/A

D4 Ethics Review Board Approval: N/A

D5 Characteristics Of Annotators: N/A

E Ai Assistants In Research Or Writing: Yes

E1 Information About Use Of Ai Assistants: No

E1 Elaboration: We used AI to rephrase some of the sentences and check the grammar of the sentences.

Author Submission Checklist: yes

Submission Number: 445

Loading