Abstract: Large language models (LLMs) have made significant advancements in math problem solving, but their large size and high latency render them impractical for real-world applications in intelligent mathematics solvers. Recently, task-agnostic compact models have been developed to replace LLMs in general natural language processing tasks. However, these models often struggle to acquire sufficient math-related knowledge from LLMs, leading to unsatisfactory performance in solving math word problems (MWPs). To develop a specialized compact model for representing MWPs, we develop the knowledge distillation (KD) technique to extract mathematical semantics knowledge from the large pre-trained model BERT. Effective knowledge types and distillation strategies are explored through extensive experiments. Our KD algorithm employs multi-knowledge distillation to extract fundamental knowledge from hidden states in the middle to lower layers, while also incorporating knowledge of mathematical relations and symbol constraints from higher-layer outputs and math decoder outputs, by leveraging bottleneck networks. Pre-training tasks on MWP datasets, such as masked language modeling and part-of-speech tagging, are also utilized to enhance the generalization of the compact model for MWP understanding. Additionally, a simple parameter mixing strategy is employed to prevent catastrophic forgetting of acquired knowledge. Our findings indicate that our approach can reduce the size of a BERT model by 10% while retaining approximately 95% of its performance on MWP datasets, outperforming the mainstream BERT-based task-agnostic compact models. The efficacy of each component has been validated through ablation studies.
Loading