A Comparative Analysis of LLM and Specialized NLP System for Automated Assessment of Science Content
Abstract: Automated essay scoring (AES) systems are increasingly used to support writing instruction. While existing AES tools have shown promise, the emergence of large language models (LLMs) offers new opportunities to enhance these systems. If effective, general-purpose LLMs could also make automated assessment more accessible to educators who lack domain-specific tools. In this study, we conducted a comparative analysis of Llama-3, an open-source LLM, using different prompting strategies, and PyrEval, a specialized NLP-based assessment tool, to evaluate science content in middle school students’ essays. PyrEval achieved higher overall performance in detecting key science concepts. We also conducted an error analysis to identify systematic differences in model behavior. By examining where models struggled, we identified distinct strengths and limitations of each approach and discussed implications for tool refinement. This work highlights the potential of LLMs to complement specialized systems and expand access to automated science assessment, while emphasizing the need to improve their reliability.
Loading