Beyond Accuracy: Alignment and Error Detection across Languages in the Bi-GSM8K Math-Teaching Benchmark

ACL ARR 2025 July Submission853 Authors

29 Jul 2025 (modified: 01 Sept 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Recent advancements in LLMs have significantly improved mathematical problem-solving, with models like GPT-4 achieving human-level performance. However, proficiently solving mathematical problems differs fundamentally from effectively teaching mathematics. To bridge this gap, we introduce the Bi-GSM8K benchmark, a bilingual English-Korean dataset enriched with teacher solutions, student solutions, and annotations marking students' initial errors. This dataset is designed to evaluate two core capabilities of LLMs: (1) measuring similarity between student and teacher solutions, and (2) identifying the initial error point in student solutions. Our method achieves high agreement with human judgments, with Pearson 0.89 and Spearman 0.88 on English, and Pearson 0.89 and Spearman 0.87 on Korean. It also offers significantly lower latency and resource usage than commercial APIs, demonstrating strong computational efficiency. In the error detection task, open-source models achieved approximately 86\% accuracy, with performance within 10\% points of commercial LLMs API, suggesting strong practical potential. Our key contributions include the open-source release of Bi-GSM8K, novel evaluation metrics, and comparative analyses of LLM performance across languages.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: benchmarking; language resources; multilingual corpora; automatic creation and evaluation of language resources; automatic evaluation of datasets; evaluation
Contribution Types: NLP engineering experiment, Data resources, Data analysis
Languages Studied: English, Korean
Previous URL: https://openreview.net/forum?id=gUJvf3e81Q
Explanation Of Revisions PDF: pdf
Reassignment Request Area Chair: No, I want the same area chair from our previous submission (subject to their availability).
Reassignment Request Reviewers: No, I want the same set of reviewers from our previous submission (subject to their availability)
A1 Limitations Section: This paper has a limitations section.
A2 Potential Risks: N/A
B Use Or Create Scientific Artifacts: No
B1 Cite Creators Of Artifacts: Yes
B1 Elaboration: Abstraction
B2 Discuss The License For Artifacts: N/A
B3 Artifact Use Consistent With Intended Use: N/A
B4 Data Contains Personally Identifying Info Or Offensive Content: N/A
B5 Documentation Of Artifacts: Yes
B5 Elaboration: 1,2,3
B6 Statistics For Data: Yes
B6 Elaboration: 3, Appendix
C Computational Experiments: Yes
C1 Model Size And Budget: Yes
C1 Elaboration: 5
C2 Experimental Setup And Hyperparameters: Yes
C2 Elaboration: 5
C3 Descriptive Statistics: Yes
C3 Elaboration: 5, Appendix
C4 Parameters For Packages: Yes
C4 Elaboration: 4,5
D Human Subjects Including Annotators: Yes
D1 Instructions Given To Participants: Yes
D1 Elaboration: Appendix
D2 Recruitment And Payment: N/A
D3 Data Consent: Yes
D3 Elaboration: 3
D4 Ethics Review Board Approval: Yes
D4 Elaboration: 3
D5 Characteristics Of Annotators: Yes
D5 Elaboration: 3
E Ai Assistants In Research Or Writing: Yes
E1 Information About Use Of Ai Assistants: Yes
E1 Elaboration: 3,4,5, appendix
Author Submission Checklist: yes
Submission Number: 853
Loading