Automatic Evaluation of the Pedagogical Effectiveness of Open-Domain Chatbots in a Language-Learning Game

Bianca Ciobanica; Serge Bibauw; Anaïs Tack

Automatic Evaluation of the Pedagogical Effectiveness of Open-Domain Chatbots in a Language-Learning Game

Bianca Ciobanica, Serge Bibauw, Anaïs Tack

Published: 15 Oct 2025, Last Modified: 31 Oct 2025BNAIC/BeNeLearn 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Track: Type D (Master/Bachelor Thesis Abstracts)

Keywords: chatbot, language model, classification, automatic evaluation, metric, language learning, conversation, deep learning, interaction, pedagogy

Abstract: This master’s thesis introduces SEED (Scoring Educational Effectiveness of a Dialogue), a novel metric and approach to evaluating the pedagogical value of responses generated by large language models (LLMs) in a language-learning conversational game. As LLM-powered chatbots are being deployed more widely as educational tools (Ji, Han, & Ko, 2022; Kochmar et al., 2025), their design requires thorough evaluation to ensure not only their effectiveness, but also their reliability, given their social impact and role as interactive agents (Xu, Chen, & Huang, 2022). This setting makes it difficult to apply traditional rule-based evaluation metrics effectively, and while recent advancements in neural-based dialogue evaluation have improved assessment methods (Maurya et al., 2025), there remains no clear consensus on automatic evaluation of dialogue data (Yeh, Eskenazi, & Mehri, 2021). Therefore, assessing the quality of open-ended systems in educational settings remains a non-trivial task. It spans linguistic proficiency, conversational, social, and pedagogical skills, along with broader goals such as effectiveness and user satisfaction (Ji, Han, & Ko, 2022). In dialogue designed to support second language development with open dialogue flows, overall dialogue metrics capture only part of the picture as they do not take into account the learning experience. We need evaluation methods that not only reflect how language is performed, but also how it is acquired by accounting for pedagogical factors necessary for language acquisition, such as input exposure, output production, and negotiation of meaning (Loewen & Sato, 2018). Without a robust evaluation framework, it remains unclear whether current systems successfully fulfill these didactic functions. Moreover, in the context of evaluating a conversational language-learning game, chatbot assessment does not occur in a traditional mentor–learner classroom setting, but rather within a character–player learning experience. To address this challenge, this thesis investigates the following research question: how do we evaluate the pedagogical effectiveness of a dialogue where the learning goals are not explicitly stated but rather embedded. Evaluating such a system adds an entirely new layer; it requires an approach that is sensitive to function, effect and takes into account the open-ended and multi-dimensional nature of those conversations (Kochmar et al., 2025; Maurya et al., 2025; Tack & Piech, 2022). The empirical foundation of this study is provided by Language Hero (LH; Linguineo, 2024), a task-based conversational language learning game that diverges from traditional the teacher-student. Within the game, the chatbot functions as a narrative character, while the learner assumes the role of a player. Language acquisition is thus facilitated through open-ended dialogue embedded in the game’s storyline. We proceeded in three stages: annotation framework design, metric development, and model implementation. First, we introduced a turn-level annotation framework grounded in an interactionist second language acquisition theory (Mackey, 2020). It focuses on three core dimensions: communicative intent, learner output, and interactional support. These dimensions were operationalized through a manually annotated corpus of in-game dialogues from LH. Subsequently, the annotations guide SEED, a metric aggregating the learning potential at a macro-level. It offers a lens through which to evaluate open-ended dialogue. In the last stage, the evaluation was framed as a supervised classification task, using fine-tuned BERT models (Devlin et al., 2019) and their variants (Liu et al., 2019; Pires, Schlinger, & Garrette, 2019; Sanh et al., 2020) to develop the learnable metric. We implemented task-specific and unified architectures using a hybrid input encoding strategy, either including or excluding conversational context. The results show that the inclusion of context affects each pedagogical dimension differently. Distil-BERT demonstrated the highest performance (F1 = 0.84) in predicting communicative intent when conversational context was included in the input. In contrast, output elicitation was most accurately predicted by RoBERTa when context was excluded (F1 = 0.98). Overall, predicting interactional support was most effective with BERT in a unified, context-aware architecture (F1 = 0.81), suggesting that shared representations across pedagogical dimensions enhance model performance. Future work will validate the SEED metric with end users to assess effectiveness and reliability. SEED-informed scores could enable cross-dialogue evaluation, dynamic prompt adaptation, and serve as a baseline for semi-supervised learning. We also plan to compare fine-tuned BERT with few-shot LLM prompting in a hybrid model that combines their strengths (Wang & Chen, 2025).

Serve As Reviewer: ~Anaïs_Tack1

Submission Number: 52

Loading