RedHat: Reducing Hallucination in Essay Critiques from Large Language Models

ACL ARR 2025 February Submission6036 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Essay critiques refer to the textual assessment of an essay, serving as the basis for the grading of the essay, and are also crucial for the improvements of the essay. Essay critique generation has received increasing attention after the blooming of large language models (LLMs), which show promising potential in writing and critiquing essays. However, current LLMs suffer from hallucinations when generating essay critiques (e.g., baseless criticism), which are still under-explored in the community. To facilitate research in reliable essay critique generation, we first define this task with a unified input-output format as well as clear evaluation criteria. To minimize hallucinations in critique generation, we introduce RedHat, a novel approach that embeds the key information from an essay directly into the generation process through document-level question-answering, ensuring critiques stay firmly anchored to the evaluated content. We collected a large-scale, high-quality essay critique dataset called EssayC, annotated by human experts over multiple LLM-generated critiques, from an undergraduate essay writing course. We experimented RedHat backboned by commercial and open-sourced LLMs. Results showed that critiques generated by RedHat are preferred by auto-judger and human experts over baseline in around 20\% of cases on EssayC in ambiguity and informativeness, with a decrement around 10\% on hallucinations under our evaluation criteria.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: Essay critiques, hallucination
Contribution Types: Approaches to low-resource settings, Publicly available software and/or pre-trained models, Data resources
Languages Studied: Chinese, English
Submission Number: 6036
Loading