Beyond Accuracy: Alignment and Error Detection across Languages in the Bi-GSM8K Math-Teaching Benchmark

Beyond Accuracy: Alignment and Error Detection across Languages in the Bi-GSM8K Math-Teaching Benchmark

ACL ARR 2025 May Submission4398 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Recent advancements in Large Language Models (LLMs) have significantly improved mathematical problem-solving, with models like GPT-4 achieving human-level performance. However, proficiently solving mathematical problems differs fundamentally from effectively teaching mathematics. To bridge this gap, we introduce the Bi-GSM8K benchmark, a bilingual English-Korean dataset enriched with authentic teacher-generated solutions, student-generated solutions, and annotations marking students' initial errors. This dataset facilitates a comprehensive evaluation of how closely LLM-generated solutions align with human educators' reasoning and the precision of LLMs in detecting initial student errors. Our experiments showed alignment exact match rates between student and teacher solutions of 74.4% for English and 75.0% for Korean. We further evaluated various commercially available and open-source LLMs, highlighting GPT-4o's superior accuracy in initial error detection while recognizing open-source models' computational efficiency advantages. Our key contributions include the open-source release of Bi-GSM8K, novel evaluation metrics, and comparative analyses of LLM performance across languages.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: benchmarking; language resources; multilingual corpora; automatic creation and evaluation of language resources; automatic evaluation of datasets; evaluation

Contribution Types: NLP engineering experiment, Data resources, Data analysis

Languages Studied: English, Korean

Submission Number: 4398

Loading