Paramanu: A Family of Novel Efficient Generative Foundation Language Models for Indian Languages

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: generative language models, low-resource NLG, pretraining, multilingual, tokenization, instruction fine-tuning
TL;DR: We present Paramanu, a family of novel language models for Indian languages, a collection of auto-regressive monolingual, bilingual, and multilingual language models pretrained from scratch, currently covering 10 Indian languages across 5 scripts
Abstract: We present PARAMANU (which means "atom" in multiple Indian languages), a family of novel language models for Indian languages. It is a collection of auto-regressive monolingual, bilingual, and multilingual Indian language models pretrained from scratch, currently covering 10 Indian languages (Assamese, Bangla, Hindi, Konkani, Maithili, Marathi, Odia, Sanskrit, Tamil, Telugu) across 5 scripts (Bangla, Devanagari, Odia, Tamil, Telugu). The models are pretrained with a context size of 1024 on a single GPU, and are of varying sizes ranging from 13.29\,M to 367.5\,M parameters. We proposed a RoPE embedding scaling method that enables us to pretrain language models from scratch at larger sequence length context size on single GPU without increased GPU memory. We have also developed an efficient and advanced novel tokenizer with least fertility score among existing LLMs for Indian languages using a combination of BPE and Unigram that can also tokenize unseen languages written in the same script or the Roman script. We also proposed language specific tokenization for multilingual models and domain specific tokenization for monolingual language models. In order to avoid the "curse of multi-linguality" in our multilingual "mParamanu" model, we pretrained on comparable corpora by typological grouping using the same script. We proposed and performed pretraining for more than 1 epoch of training for most of our language models. From our results, we observed the language transfer phenomenon from low resource to high resource within languages of the same script and typology. We performed human evaluation of our pretrained models for open end text generation on grammar, coherence, creativity, and factuality metrics for several languages. Our Paramanu models outperformed standard and multilingual large language models (LLMs) by a large margin in performance despite being smaller in size by 64 to 20 times. We studied the impact of language specific tokenization versus language agnostic tokenization for bilingual language modeling. We also studied the impact of BPE versus Unigram tokenization for Devanagari script languages. We further created instruction-tuning datasets and instruction-tuned our pretrained models on 23,000 instructions in respective languages except Hindi, for which we used 75,000 instructions. Comparison with multilingual LLMs on various commonsense reasoning benchmarks for natural language understanding, natural language inference, and machine reading comprehension shows the advantage of our models. The performance of our Paramanu models leads to the conclusion that high quality generative language models are possible without high amount of compute power (FLOPS) and enormous number of parameters.
Primary Area: foundation or frontier models, including LLMs
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 10447
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview