MathChat: Benchmarking Mathematical Reasoning and Instruction Following in Multi-Turn Interactions

ACL ARR 2025 February Submission905 Authors

11 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large language models (LLMs) have demonstrated impressive capabilities in mathematical problem-solving, particularly in single-turn question-answering formats. However, real-world scenarios often involve mathematical reasoning that requires multi-turn or interactive information exchanges, and the performance of LLMs on these tasks is still under-explored. This paper introduces MathChat, a comprehensive benchmark specifically designed to evaluate LLMs across a broader spectrum of mathematical tasks. These tasks are structured to assess the models’ abilities in multi-turn interactions and open-ended generation. We evaluate the performance of various state-of-the-art LLMs on the MathChat benchmark, and we observe that while these models excel in single-turn question answering, they significantly underperform in more complex scenarios that require sustained reasoning and dialogue understanding. To address the above limitations of existing LLMs when faced with multi-turn and open-ended tasks, we develop MathChat_sync, a synthetic dialogue-based math dataset for LLM fine-tuning, focusing on improving models' interaction and instruction-following capabilities in conversations. Experimental results emphasize the need for training LLMs with diverse, conversational instruction tuning datasets like MathChat_sync. We believe this work outlines one promising direction for improving the multi-turn mathematical reasoning abilities of LLMs, thus pushing forward the development of LLMs that are more adept at interactive mathematical problem-solving and real-world applications.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: Math Reasoning, Interactive Reasoning, Reasoning Evaluation
Contribution Types: NLP engineering experiment, Data resources, Data analysis
Languages Studied: English
Submission Number: 905
Loading