Keywords: Multi-turn Dialogue, Chatbot Benchmark, Conversational Capability, Korean LLM Evaluation
Abstract: We introduce KoChatBench, a capability-based benchmark for evaluating Korean generative multi-turn dialogue. Existing evaluations often rely on single-turn or domain-specific tasks, limiting their ability to diagnose interaction-level failures. KoChatBench defines four core capabilities and constructs 600 sessions spanning 3--6 turns. We evaluate six commercial LLMs using an LLM-as-a-judge framework with GPT-5-mini, adopting session-level minimum aggregation to capture critical failures. Results show that Gemma-4-31B-IT achieves the strongest overall performance, while Nemotron-3-Super-120B-A12B exhibits weaknesses in conversational robustness. These findings highlight the importance of capability-level analysis and provide a structured framework for assessing stability in multi-turn interactions.
Paper Type: Short
Research Area: Resources and Evaluation
Research Area Keywords: Dialogue and Interactive Systems, Resources and Evaluation, Multi-turn Dialogue Evaluation, Korean LLM Evaluation, Chatbot Benchmark, Conversational Capability Evaluation, LLM-as-a-Judge, Context Tracking, Instruction Following, Dialogue Robustness
Contribution Types: Data resources
Languages Studied: Korean
EMNLP 2026 AI Reviewing Experiment: no
Submission Number: 16500
Loading