Benchmarking Mathematical Reasoning in a Low-Resource Language: Structured Prompting and Evaluation in Basque

Benchmarking Mathematical Reasoning in a Low-Resource Language: Structured Prompting and Evaluation in Basque

ACL ARR 2025 July Submission167 Authors

24 Jul 2025 (modified: 08 Sept 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: LLMs have shown impressive performance on tasks requiring complex reasoning, but most evaluations focus exclusively on English. This work investigates how well LLMs perform mathematical reasoning in low-resource languages, using Basque as a primary case study. To support this analysis, we introduce \textbf{MASEU}, a benchmark designed to evaluate reasoning in Basque across arithmetic, algebraic, and logical tasks, and assess both existing open models and newly trained systems. We address three key questions: how well LLMs support Basque in reasoning tasks, to what extent English in prompts can improve results, and the effect of continued pretraining in Basque. To explore these aspects, we use a prompting strategy adapted for mathematical reasoning (\textit{DUP prompting}), which allows for more precise experimentation across zero-shot and few-shot settings, providing insights into how multilingual models handle reasoning tasks in underrepresented languages.

Paper Type: Long

Research Area: Question Answering

Research Area Keywords: logical reasoning, multilingual QA, few-shot QA, math QA, question generation

Contribution Types: Approaches to low-resource settings, Publicly available software and/or pre-trained models, Data resources

Languages Studied: Basque, English

Reassignment Request Area Chair: This is not a resubmission

Reassignment Request Reviewers: This is not a resubmission

A1 Limitations Section: This paper has a limitations section.

A2 Potential Risks: Yes

A2 Elaboration: Yes, see the "Limitations" section, which addresses the scope of language coverage, the risks of overgeneralization, potential translation bias, and the exclusion of proprietary models. These points align with A2's expectations regarding fairness, representation, and responsible interpretation of results.

B Use Or Create Scientific Artifacts: Yes

B1 Cite Creators Of Artifacts: Yes

B1 Elaboration: Yes, see #2 (Related Work) and #3 (MASEU Dataset) sections. Both sections discuss prior datasets that form the basis for the creation of the MASEU benchmark, providing appropriate citations and context for how this new resource builds on existing work.

B2 Discuss The License For Artifacts: No

B2 Elaboration: No, the license is not discussed in the paper. However, the MASEU dataset is constructed using items adapted from ASDiv (licensed under CC BY-NC 4.0) and from MAWPS and SVAMP (both licensed under the MIT license). MASEU adheres to the terms of these original licenses and includes appropriate attribution. Additionally, we have contacted the original authors of all three datasets to inform them of our use and request explicit permission for redistribution where necessary. A public license for MASEU will be included upon its release to ensure clarity and compliance.

B3 Artifact Use Consistent With Intended Use: Yes

B3 Elaboration: Yes, see #3 (MASEU Dataset) section. All source datasets (ASDiv, MAWPS, and SVAMP) were also released for research purposes, and our use of them is fully consistent with that intent. We have taken care to respect original licensing conditions. Attribution is provided, and further clarifications, including terms of use for MASEU, will be added upon release.

B4 Data Contains Personally Identifying Info Or Offensive Content: No

B4 Elaboration: The datasets used to build MASEU (ASDiv, MAWPS, SVAMP) consist of synthetic math word problems and do not contain any personal identifying information or offensive content. Since no human subjects or user-generated content were involved, no additional anonymization or filtering was necessary.

B5 Documentation Of Artifacts: Yes

B5 Elaboration: Yes, see #3 (MASEU Dataset) section. It describes the dataset's experiment domain (mathematical reasoning) and languages (English, Basque). Additional information on construction and source datasets is also included.

B6 Statistics For Data: Yes

B6 Elaboration: Yes, see #3 (MASEU Dataset) section that includes relevant dataset statistics, including the number of examples and dev/test splits.

C Computational Experiments: Yes

C1 Model Size And Budget: Yes

C1 Elaboration: We report the number of parameters for all evaluated models in section #4 (Experiments), but we do not include details on total computational budget or specific computing infrastructure used. This is because we relied on publicly available, instruction-tuned models without further training, and evaluation was performed with moderate resources not tracked at fine granularity.

C2 Experimental Setup And Hyperparameters: N/A

C2 Elaboration: We relied on publicly available, instruction-tuned models without further training.

C3 Descriptive Statistics: Yes

C3 Elaboration: Yes, see section #5 (Results) and Appendices A, B and D, where we report mean performance scores across multiple runs for both MASEU and MGSM datasets with different configurations.

C4 Parameters For Packages: No

C4 Elaboration: No third-party packages for preprocessing, normalization, or evaluation were used beyond standard Python libraries (e.g., json, re, collections). All processing and evaluation scripts were implemented from scratch.

D Human Subjects Including Annotators: No

D1 Instructions Given To Participants: N/A

D1 Elaboration: The dataset was entirely created without the involvement of human annotators, crowdworkers, or external participants. As no human subjects were used in the data collection or annotation process, there were no instructions to report.

D2 Recruitment And Payment: N/A

D2 Elaboration: The dataset was entirely created without the involvement of human annotators, crowdworkers, or external participants. There was no recruiment.

D3 Data Consent: N/A

D3 Elaboration: The dataset was entirely created without the involvement of human annotators, crowdworkers, or external participants. No data consent was necessary.

D4 Ethics Review Board Approval: N/A

D4 Elaboration: As the work did not involve human subjects or third-party annotators, formal ethics review board approval was not required.

D5 Characteristics Of Annotators: N/A

D5 Elaboration: There were no annotators involved, so there are not any characteristics.

E Ai Assistants In Research Or Writing: No

E1 Information About Use Of Ai Assistants: N/A

E1 Elaboration: There is no use of AI assistants, particularly in the creation of the dataset where mathematical fidelity and linguistic naturalness was necessary.

Author Submission Checklist: yes

Submission Number: 167

Loading