Rethinking NLP Evaluation for LLM-Based Assistants in Education Through a Human-Centered Evaluation Framework
Keywords: NLP Evaluation, Large Language Models (LLMs), Human-Centered AI Evaluation
Abstract: Large language models (LLMs) are increasingly adopted as educational assistants and feedback providers; however, dominant NLP evaluation practices continue to emphasize technical metrics such as accuracy, fluency, and automated judgment. We argue that these approaches are misaligned with educational contexts because they overlook human values, learner agency, and pedagogical goals. This paper presents an argument-driven critique of NLP evaluation practices as applied to educational LLMs and introduces a literature-informed, human-centered, and sociotechnical evaluation framework with guiding questions to support its use. We highlight criteria such as explainability, consistency, and refinement, which are critical to human-AI interactive instructional effectiveness but absent from existing NLP benchmarks, and argue for more human-aligned, co-creative, and long-term evaluation practices for educational LLM systems.
Paper Type: Short
Research Area: Human-AI Interaction/Cooperation and Human-Centric NLP
Research Area Keywords: NLP Evaluation, Large Language Models (LLMs), Human-Centered AI, Educational AI
Contribution Types: Position papers
Languages Studied: English
Submission Number: 8294
Loading