Abstract: Developing high-quality large language models (LLMs) for moderately resourced languages presents unique challenges in data availability, model adaptation, and evaluation. We introduce Llama-3-Nanda-10B-Chat, or Nanda for short, a state-of-the-art Hindi-centric instruction-tuned generative LLM, designed to push the boundaries of open-source Hindi language models. Built upon Llama-3-8B, Nanda incorporates continual pretraining with expanded transformer blocks, leveraging the Llama Pro methodology. A key challenge was the limited availability of high-quality Hindi text data; we addressed this through rigorous data curation, augmentation, and strategic bilingual training, balancing Hindi and English corpora to optimize cross-linguistic knowledge transfer. With 10 billion parameters, Nanda stands among the top-performing open-source Hindi and multilingual models of similar scale, demonstrating significant advantages over many existing models. We provide an in-depth discussion of training strategies, fine-tuning techniques, safety alignment, and evaluation metrics, demonstrating how these approaches enabled Nanda to achieve state-of-the-art results. By open-sourcing Nanda, we aim to advance research in Hindi LLMs and support a wide range of real-world applications across academia, industry, and public services.
Paper Type: Long
Research Area: Language Modeling
Research Area Keywords: LLM,LLaMA-3,Hindi,low-resource language modeling,fine-tuning,safety alignment
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches to low-resource settings, Publicly available software and/or pre-trained models
Languages Studied: Hindi,English
Reassignment Request Area Chair: This is not a resubmission
Reassignment Request Reviewers: This is not a resubmission
A1 Limitations Section: This paper has a limitations section.
A2 Potential Risks: Yes
A2 Elaboration: Appendix H.3
B Use Or Create Scientific Artifacts: Yes
B1 Cite Creators Of Artifacts: No
B1 Elaboration: We have created a large language model
B2 Discuss The License For Artifacts: Yes
B2 Elaboration: Appendix G
B3 Artifact Use Consistent With Intended Use: Yes
B3 Elaboration: Appendix G
B4 Data Contains Personally Identifying Info Or Offensive Content: N/A
B5 Documentation Of Artifacts: Yes
B5 Elaboration: Appendix G
B6 Statistics For Data: Yes
B6 Elaboration: Appendix A, B
C Computational Experiments: Yes
C1 Model Size And Budget: Yes
C1 Elaboration: Appendix E, G
C2 Experimental Setup And Hyperparameters: Yes
C2 Elaboration: Sections 3.2, 4.2
C3 Descriptive Statistics: N/A
C4 Parameters For Packages: N/A
D Human Subjects Including Annotators: Yes
D1 Instructions Given To Participants: Yes
D1 Elaboration: Appendix F
D2 Recruitment And Payment: No
D2 Elaboration: We shall mention the details in the camera ready version of our paper
D3 Data Consent: No
D3 Elaboration: We shall mention the details in the camera ready version of our paper
D4 Ethics Review Board Approval: N/A
D4 Elaboration: We shall mention the details in the camera ready version of our paper
D5 Characteristics Of Annotators: Yes
D5 Elaboration: Appendix F, We shall mention more details in the camera ready version of our paper
E Ai Assistants In Research Or Writing: Yes
E1 Information About Use Of Ai Assistants: No
E1 Elaboration: We used AI assistants only for debugging purpose. We shall mention the details in the camera ready version of our paper.
Author Submission Checklist: yes
Submission Number: 1314
Loading