Abstract: The application of language models in essay scoring has gained significant attention in recent years, typically on the basis of evaluating a single model across multiple prompts. However, in a multi-prompt setup, it is crucial to understand the varying aspects of different prompts. In such settings, there exist notable variations even in a trait with the same name across prompts. This semantic variation on the same traits underscores the need to treat them differently at a fine-grained level according to each prompt. In this study, we propose a multi-level disentanglement framework for multi-prompt essay scoring, designed to achieve fine-grained disentanglement of semantic differences across such traits. Our method not only improves the quality of the essay scoring, but also reduces memory usage and latency. Experimental results highlight that our framework surpasses seven state-of-the-art essay scoring methods and large language model(LLM)-based zero-shot and few-shot approaches, achieving the highest agreement with human essay ratings.
External IDs:dblp:journals/datamine/HanRHSY25
Loading