Llama-3-Nanda-10B-Chat: An Open Generative Large Language Model for Hindi

Llama-3-Nanda-10B-Chat: An Open Generative Large Language Model for Hindi

ACL ARR 2025 July Submission1314 Authors

29 Jul 2025 (modified: 04 Sept 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Developing high-quality large language models (LLMs) for moderately resourced languages presents unique challenges in data availability, model adaptation, and evaluation. We introduce Llama-3-Nanda-10B-Chat, or Nanda for short, a state-of-the-art Hindi-centric instruction-tuned generative LLM, designed to push the boundaries of open-source Hindi language models. Built upon Llama-3-8B, Nanda incorporates continual pretraining with expanded transformer blocks, leveraging the Llama Pro methodology. A key challenge was the limited availability of high-quality Hindi text data; we addressed this through rigorous data curation, augmentation, and strategic bilingual training, balancing Hindi and English corpora to optimize cross-linguistic knowledge transfer. With 10 billion parameters, Nanda stands among the top-performing open-source Hindi and multilingual models of similar scale, demonstrating significant advantages over many existing models. We provide an in-depth discussion of training strategies, fine-tuning techniques, safety alignment, and evaluation metrics, demonstrating how these approaches enabled Nanda to achieve state-of-the-art results. By open-sourcing Nanda, we aim to advance research in Hindi LLMs and support a wide range of real-world applications across academia, industry, and public services.

Paper Type: Long

Research Area: Language Modeling

Research Area Keywords: LLM,LLaMA-3,Hindi,low-resource language modeling,fine-tuning,safety alignment

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches to low-resource settings, Publicly available software and/or pre-trained models

Languages Studied: Hindi,English

Reassignment Request Area Chair: This is not a resubmission

Reassignment Request Reviewers: This is not a resubmission

A1 Limitations Section: This paper has a limitations section.

A2 Potential Risks: Yes

A2 Elaboration: Appendix H.3

B Use Or Create Scientific Artifacts: Yes

B1 Cite Creators Of Artifacts: No

B1 Elaboration: We have created a large language model

B2 Discuss The License For Artifacts: Yes

B2 Elaboration: Appendix G

B3 Artifact Use Consistent With Intended Use: Yes

B3 Elaboration: Appendix G

B4 Data Contains Personally Identifying Info Or Offensive Content: N/A

B5 Documentation Of Artifacts: Yes

B5 Elaboration: Appendix G

B6 Statistics For Data: Yes

B6 Elaboration: Appendix A, B

C Computational Experiments: Yes

C1 Model Size And Budget: Yes

C1 Elaboration: Appendix E, G

C2 Experimental Setup And Hyperparameters: Yes

C2 Elaboration: Sections 3.2, 4.2

C3 Descriptive Statistics: N/A

C4 Parameters For Packages: N/A

D Human Subjects Including Annotators: Yes

D1 Instructions Given To Participants: Yes

D1 Elaboration: Appendix F

D2 Recruitment And Payment: No

D2 Elaboration: We shall mention the details in the camera ready version of our paper

D3 Data Consent: No

D3 Elaboration: We shall mention the details in the camera ready version of our paper

D4 Ethics Review Board Approval: N/A

D4 Elaboration: We shall mention the details in the camera ready version of our paper

D5 Characteristics Of Annotators: Yes

D5 Elaboration: Appendix F, We shall mention more details in the camera ready version of our paper

E Ai Assistants In Research Or Writing: Yes

E1 Information About Use Of Ai Assistants: No

E1 Elaboration: We used AI assistants only for debugging purpose. We shall mention the details in the camera ready version of our paper.

Author Submission Checklist: yes

Submission Number: 1314

Loading