Prompt vs. Supervise: A Case of Using Language Models to Assess Students' Science Explanations

Published: 2025, Last Modified: 08 Jan 2026EC-TE (1) 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: General-purpose Language Models (LMs) open up new possibilities for automated assessment through simple textual prompts and little to no supervised data. However, there are significant tradeoffs in using prompt-based approaches compared to predictive modeling (e.g., flexibility in model training v. limitations with model evaluation). With a case study assessing middle school students’ written science explanations, we empirically investigate the differences in prompting LMs versus using annotated data to train supervised models. Our results across six scientific concepts show that finetuning LMs with supervised data leads to notable improvements model performance. The best performing model across all the content units was the fully supervised RoBERTa finetuned models, with AUCs above 0.90 in most cases. With access to only a few examples, the Llama 3 few-shot models with no supervision also perform consistently well with AUCs above 0.77. However, it is unclear if the flexibility in task redefinition without annotated data and model training outweighs the drop in performance. We conclude by presenting a discussion on these design tradeoffs.
Loading