Unlocking Multilingual Math Potential: Strategies Amidst Data Scarcity

ACL ARR 2024 August Submission316 Authors

16 Aug 2024 (modified: 10 Sept 2024)ACL ARR 2024 August SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Existing works studying behaviors of Large Language Models (LLMs) in multilingual settings focus mainly on general downstream tasks such as instruction following. To fill this gap, we perform an in-depth analysis of LLMs' math reasoning capacity under multilingual settings and propose to alleviate the shortage of high quality multilingual math reasoning post-training data by exploring whether prior English math knowledge and additional English data helps, and by observing the effects of multilingual synthetic data on performance. For models pre-trained with mostly English data, we find that prior English math knowledge helps and that scaling English data helps only when the training and evaluation data belong to similar distributions (human/machine translated). Additonally, we find that inclusion of multilingual synthetic data leads to improved performance on human-translated data but degraded performance on machine-translated data. Our findings shed light on effective finetuning of LLMs for better multilingual math reasoning performance given the shortage high-quality multilingual math reasoning data.
Paper Type: Short
Research Area: Multilingualism and Cross-Lingual NLP
Research Area Keywords: multilingualism,multilingual benchmarks
Contribution Types: Approaches to low-resource settings
Languages Studied: English, French, Spanish, German, Bengali, Chinese, Telugu, Swahili, Thai, Japanese, Russian
Submission Number: 316
Loading