Automatic Evaluation of the Pedagogical Effectiveness of Open-Domain Chatbots in a Language-Learning Game
Track: Type D (Master/Bachelor Thesis Abstracts)
Keywords: chatbot, language model, classification, automatic evaluation, metric, language learning, conversation, deep learning, interaction, pedagogy
Abstract: This master’s thesis introduces SEED (Scoring Educational
Effectiveness of a Dialogue), a novel metric and approach to evaluating
the pedagogical value of responses generated by large language models
(LLMs) in a language-learning conversational game. As LLM-powered
chatbots are being deployed more widely as educational tools (Ji, Han,
& Ko, 2022; Kochmar et al., 2025), their design requires thorough evaluation to ensure not only their effectiveness, but also their reliability, given
their social impact and role as interactive agents (Xu, Chen, & Huang,
2022). This setting makes it difficult to apply traditional rule-based evaluation metrics effectively, and while recent advancements in neural-based
dialogue evaluation have improved assessment methods (Maurya et al.,
2025), there remains no clear consensus on automatic evaluation of dialogue data (Yeh, Eskenazi, & Mehri, 2021).
Therefore, assessing the quality of open-ended systems in educational
settings remains a non-trivial task. It spans linguistic proficiency, conversational, social, and pedagogical skills, along with broader goals such
as effectiveness and user satisfaction (Ji, Han, & Ko, 2022). In dialogue
designed to support second language development with open dialogue
flows, overall dialogue metrics capture only part of the picture as they
do not take into account the learning experience. We need evaluation
methods that not only reflect how language is performed, but also how
it is acquired by accounting for pedagogical factors necessary for language acquisition, such as input exposure, output production, and negotiation of meaning (Loewen & Sato, 2018). Without a robust evaluation
framework, it remains unclear whether current systems successfully fulfill
these didactic functions. Moreover, in the context of evaluating a conversational language-learning game, chatbot assessment does not occur
in a traditional mentor–learner classroom setting, but rather within a
character–player learning experience.
To address this challenge, this thesis investigates the following research
question: how do we evaluate the pedagogical effectiveness of a dialogue where the learning goals are not explicitly stated but rather embedded. Evaluating such a system adds an entirely new layer; it requires an approach that is sensitive to function, effect and takes into
account the open-ended and multi-dimensional nature of those conversations (Kochmar et al., 2025; Maurya et al., 2025; Tack & Piech, 2022).
The empirical foundation of this study is provided by Language Hero
(LH; Linguineo, 2024), a task-based conversational language learning
game that diverges from traditional the teacher-student. Within the
game, the chatbot functions as a narrative character, while the learner
assumes the role of a player. Language acquisition is thus facilitated
through open-ended dialogue embedded in the game’s storyline.
We proceeded in three stages: annotation framework design, metric development, and model implementation. First, we introduced a turn-level
annotation framework grounded in an interactionist second language acquisition theory (Mackey, 2020). It focuses on three core dimensions:
communicative intent, learner output, and interactional support. These
dimensions were operationalized through a manually annotated corpus
of in-game dialogues from LH.
Subsequently, the annotations guide SEED, a metric aggregating the learning potential at a macro-level. It offers a lens through which to evaluate open-ended dialogue.
In the last stage, the evaluation was framed as a supervised classification task, using fine-tuned BERT models (Devlin et al., 2019) and their
variants (Liu et al., 2019; Pires, Schlinger, & Garrette, 2019; Sanh et al.,
2020) to develop the learnable metric. We implemented task-specific and
unified architectures using a hybrid input encoding strategy, either including or excluding conversational context. The results show that the
inclusion of context affects each pedagogical dimension differently. Distil-BERT demonstrated the highest performance (F1 = 0.84) in predicting
communicative intent when conversational context was included in the
input. In contrast, output elicitation was most accurately predicted by
RoBERTa when context was excluded (F1 = 0.98). Overall, predicting
interactional support was most effective with BERT in a unified, context-aware architecture (F1 = 0.81), suggesting that shared representations
across pedagogical dimensions enhance model performance.
Future work will validate the SEED metric with end users to assess
effectiveness and reliability. SEED-informed scores could enable cross-dialogue evaluation, dynamic prompt adaptation, and serve as a baseline for semi-supervised learning. We also plan to compare fine-tuned BERT
with few-shot LLM prompting in a hybrid model that combines their strengths (Wang & Chen, 2025).
Serve As Reviewer: ~Anaïs_Tack1
Submission Number: 52
Loading