Beyond Accuracy: Alignment and Error Detection across Languages in the Bi-GSM8K Math-Teaching Benchmark

Beyond Accuracy: Alignment and Error Detection across Languages in the Bi-GSM8K Math-Teaching Benchmark

ACL ARR 2025 July Submission853 Authors

29 Jul 2025 (modified: 01 Sept 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Recent advancements in LLMs have significantly improved mathematical problem-solving, with models like GPT-4 achieving human-level performance. However, proficiently solving mathematical problems differs fundamentally from effectively teaching mathematics. To bridge this gap, we introduce the Bi-GSM8K benchmark, a bilingual English-Korean dataset enriched with teacher solutions, student solutions, and annotations marking students' initial errors. This dataset is designed to evaluate two core capabilities of LLMs: (1) measuring similarity between student and teacher solutions, and (2) identifying the initial error point in student solutions. Our method achieves high agreement with human judgments, with Pearson 0.89 and Spearman 0.88 on English, and Pearson 0.89 and Spearman 0.87 on Korean. It also offers significantly lower latency and resource usage than commercial APIs, demonstrating strong computational efficiency. In the error detection task, open-source models achieved approximately 86\% accuracy, with performance within 10\% points of commercial LLMs API, suggesting strong practical potential. Our key contributions include the open-source release of Bi-GSM8K, novel evaluation metrics, and comparative analyses of LLM performance across languages.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: benchmarking; language resources; multilingual corpora; automatic creation and evaluation of language resources; automatic evaluation of datasets; evaluation

Contribution Types: NLP engineering experiment, Data resources, Data analysis

Languages Studied: English, Korean

Previous URL: https://openreview.net/forum?id=gUJvf3e81Q

Explanation Of Revisions PDF: pdf

Reassignment Request Area Chair: No, I want the same area chair from our previous submission (subject to their availability).

Reassignment Request Reviewers: No, I want the same set of reviewers from our previous submission (subject to their availability)

A1 Limitations Section: This paper has a limitations section.

A2 Potential Risks: N/A

B Use Or Create Scientific Artifacts: No

B1 Cite Creators Of Artifacts: Yes

B1 Elaboration: Abstraction

B2 Discuss The License For Artifacts: N/A

B3 Artifact Use Consistent With Intended Use: N/A

B4 Data Contains Personally Identifying Info Or Offensive Content: N/A

B5 Documentation Of Artifacts: Yes

B5 Elaboration: 1,2,3

B6 Statistics For Data: Yes

B6 Elaboration: 3, Appendix

C Computational Experiments: Yes

C1 Model Size And Budget: Yes

C1 Elaboration: 5

C2 Experimental Setup And Hyperparameters: Yes

C2 Elaboration: 5

C3 Descriptive Statistics: Yes

C3 Elaboration: 5, Appendix

C4 Parameters For Packages: Yes

C4 Elaboration: 4,5

D Human Subjects Including Annotators: Yes

D1 Instructions Given To Participants: Yes

D1 Elaboration: Appendix

D2 Recruitment And Payment: N/A

D3 Data Consent: Yes

D3 Elaboration: 3

D4 Ethics Review Board Approval: Yes

D4 Elaboration: 3

D5 Characteristics Of Annotators: Yes

D5 Elaboration: 3

E Ai Assistants In Research Or Writing: Yes

E1 Information About Use Of Ai Assistants: Yes

E1 Elaboration: 3,4,5, appendix

Author Submission Checklist: yes

Submission Number: 853

Loading