Paramanu-Ganita: An Efficient Pre-trained Generative Mathematics Language Model with Chain-of-Thought Instruction Fine-Tuning

26 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: reasoning, language models, pretraining, CoT fine-tuning, AI4Math
TL;DR: An efficient language model for mathematics, which is pretrained from scratch on a custom corpus on 31.5 billion tokens; despite having only 208 million parameters, it outperforms several large and very large LLMs on standard benchmarks
Abstract: In this paper, we pose the following question: whether domain specific pretraining of tiny generative language models from scratch with domain specialized tokenizer and Chain-of-Thought (CoT) instruction fine-tuning results in very competitive performance on mathematical reasoning than LLMs which are trained on trillion of tokens and humongous parameters? Secondly, we pose our second RQ: whether domain specific pretraining from scratch is environmentally sustainable, highly cost efficient? To address these research questions, we present Paramanu-Ganita, a 208 million-parameter novel Auto Regressive (AR) decoder based language model on mathematics. We performed pretraining from scratch on 31.5 billion tokens using a context size of 4096 on a mixed mathematical corpus consisting of mathematical web pages, mathematics related source code such as AlgebraStack, mathematical textbooks, Chain-of-Thought (CoT) templatised mathematical StackOverflow question answers pairs, and mathematical lecture notes in LaTeX curated by us. We also trained a math and code specialised BPE tokenizer. We proposed and performed Chain-of-Thought instruction fine-tuning of Paramanu-Ganita on the MetaMathQA dataset. We evaluate our model on GSM8K and MATH mathematical benchmarks, and on logical deductive reasoning (LogiQA) and multiple choice high school and college level math questions from SAT (AGIEVAL-SAT-Math), GRE/GMAT questions (AGIEVAL-AQuA-RAT), college and high school level math questions from MMLU. Our model Paramanu-Ganita, despite being 34 times smaller than the 7B LLMs, outperforms general LLMs by approximately 30% points, and even math-specialised LLMs by 3-23% points in GSM8K test accuracy metric. On MATH benchmark, Paramanu-Ganita outperformed the various models by 6-8% points. On other benchmarks such as LogiQA logical deductive reasoning benchmark, mathematical high school level multi-choice questions (MMLU-math-high-school), GRE-GMAT level quantitative questions (AGIEVAL-AQuA-RAT), SAT level math questions, Paramanu-Ganita was better than the others by about 1-4% points. The large significant margin improvement in performance of our math model over the existing LLMs signifies that reasoning capabilities of language models are just not restricted to those with humongous number of parameters. Paramanu-Ganita took only 170 hours of A100 training whereas large LLMs such as the math-specialised LLM, LLEMMA 7B, was trained for 23,000 A100 equivalent hours. Thus, our approach of pretraining powerful domain-specialised language models from scratch for domain adaptation is much more cost-effective and environmental friendly than performing continual training of LLMs.
Primary Area: foundation or frontier models, including LLMs
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 5714
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview