Abstract: Bengali is an underrepresented language in NLP research. However, it remains a challenge due to its unique linguistic structure and computational constraints. In this work, we systematically investigate the challenges that hinder Bengali NLP performance by focusing on the absence of standardized evaluation benchmarks. We then evaluated 10 recent open source Large Language Models (LLMs) in 8 of the translated datasets and performed a comprehensive error analysis to pinpoint their primary failure modes. Our findings reveal consistent performance gaps for Bengali compared to English, particularly for smaller models and specific model families like Mistral. We also identified promising robustness in certain architectures, such as DeepSeek, that maintain more stable performance across languages. Our analysis reveals an inverse relationship between tokenization efficiency and LLM accuracy where models tend to perform worse when inputs are excessively tokenized, whereas more efficient \& concise tokenization results in improved performance. These findings highlight critical areas where current models fall short and underscore the need for improved dataset quality and evaluation methodologies tailored to multilingual contexts. This work will catalyze further research on NLP for underrepresented languages, helping to democratize access to advanced language technologies worldwide.
Paper Type: Long
Research Area: Multilingualism and Cross-Lingual NLP
Research Area Keywords: Large Language Models, Bengali NLP, Low-resource languages, Multilingual evaluation, Benchmark datasets, Machine Translation, Tokenization
Contribution Types: Approaches to low-resource settings, Publicly available software and/or pre-trained models, Data resources, Data analysis
Languages Studied: English, Bengali
Reassignment Request Area Chair: This is not a resubmission
Reassignment Request Reviewers: This is not a resubmission
A1 Limitations Section: This paper has a limitations section.
A2 Potential Risks: N/A
B Use Or Create Scientific Artifacts: Yes
B1 Cite Creators Of Artifacts: Yes
B1 Elaboration: References
B2 Discuss The License For Artifacts: Yes
B2 Elaboration: Apache 2.0 License
B3 Artifact Use Consistent With Intended Use: Yes
B3 Elaboration: Conclusion. Section 5.
B4 Data Contains Personally Identifying Info Or Offensive Content: No
B4 Elaboration: The datasets sourced are personal identifier free.
B5 Documentation Of Artifacts: Yes
B5 Elaboration: Github repository.
B6 Statistics For Data: Yes
B6 Elaboration: Section 2
C Computational Experiments: Yes
C1 Model Size And Budget: Yes
C1 Elaboration: Section 2
C2 Experimental Setup And Hyperparameters: Yes
C2 Elaboration: Translation related parameters were discussed in section 2.2.
C3 Descriptive Statistics: Yes
C3 Elaboration: Result Analysis. Section 2.5
C4 Parameters For Packages: Yes
C4 Elaboration: Experiment Details . Section - 3
D Human Subjects Including Annotators: No
D1 Instructions Given To Participants: N/A
D2 Recruitment And Payment: N/A
D3 Data Consent: N/A
D4 Ethics Review Board Approval: N/A
D5 Characteristics Of Annotators: N/A
E Ai Assistants In Research Or Writing: Yes
E1 Information About Use Of Ai Assistants: No
E1 Elaboration: We used AI to rephrase some of the sentences and check the grammar of the sentences.
Author Submission Checklist: yes
Submission Number: 445
Loading