Keywords: Large Language Models, Benchmarking, Evaluation, Korean Language, Dataset Construction
TL;DR: The paper presents the Thunder Korean Benchmark Suite and shows that reliable Korean LLM evaluation requires both language-aware benchmark construction and carefully designed scoring protocols.
Abstract: Reliable evaluation of foundation models in Korean requires benchmarks that measure intended capabilities rather than artifacts introduced by translation, localization, or evaluation protocol. In practice, Korean evaluation often adapts established English benchmarks, but literal translation can alter task difficulty, reduce prompt naturalness, or change what the task is intended to evaluate. We present a Thunder Korean Benchmark Suite comprising Ko-ARC, Ko-GSM8K, Ko-EQ-Bench, Ko-WinoGrande, Ko-LAMBADA, and Ko-IFEval, covering six capabilities across 9,396 items. Rather than treating translation as a single preprocessing step, we construct each subset using one of three routes: expert-reviewed translation and localization, direct Korean construction, or a hybrid of localized adaptation and Korean-specific redesign. For multiple-choice subsets, we also report NPSQ-based accuracy to assess whether models rely on question evidence rather than superficial choice preference. Evaluation results show that model strengths differ across tasks, and that larger models are not always the best-performing models. We further find that different scoring methods can lead to different interpretations depending on the task, highlighting the need to report benchmark scores together with their evaluation protocol.
Paper Type: Long (8 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 85
Loading