Aligning Context and Formulas: Multi-Stage Fine-Tuning for Large Language Models with A Novel Dataset ContextFormulas50K

ACL ARR 2025 February Submission4502 Authors

15 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract:

Mathematics plays a crucial role in scientific research. However, it is never easy to formulate a problem using mathematical formulas, even for senior researchers. Fine-tuning pre-trained models on mathematical datasets to facilitate research has become a widely adopted approach. Although there are numerous mathematical datasets, only high-quality datasets could enhance the pre-trained model for a deep insight into formulas because mathematics is a precise discipline. To address this, we propose the ContextFormulas50K dataset, which consists of mathematical formulas paired with their contextual text. Based on this dataset, we fine-tune two pre-trained models (i.e. LLaMA-8B-Instruct and Mistral-7B) to generate precise formulas for research work. But in this task, there emerges the information bottleneck problem, which means the parameter scale expands much faster than performance. To overcome this problem, we introduce a Multi-Stage Fine-Tuning(MSFT) approach to equip large language models (LLMs) with better mathematical comprehension. Specifically, there are three stages in our model, and by progressively inserting plug-and-play modules, model performance could be enhanced at each stage, which means our method effectively alleviates information bottlenecks. Experimental results show that our model effectively improves mathematical comprehension and achieves state-of-the-art performance in the formula generation task, outperforming multiple commercial baselines (i.e., GPT-4, GPT-4o, and Claude-3.5-sonnet).

Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: mathematical NLP
Contribution Types: NLP engineering experiment, Data resources
Languages Studied: English
Submission Number: 4502
Loading