EssayBench: Evaluating Large Language Models in Multi-Genre Chinese Essay Writing

EssayBench: Evaluating Large Language Models in Multi-Genre Chinese Essay Writing

ACL ARR 2025 May Submission2988 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Chinese essay writing and its evaluation are critical in educational contexts, yet the capabilities of large language models (LLMs) in this domain remain largely underexplored. Existing benchmarks often rely on coarse-grained text quality metrics, largely overlooking the structural and rhetorical complexities of Chinese essays, particularly across diverse genres. To address this gap, we propose essaybench, a fine-grained benchmark specifically designed for Chinese essay writing across four major genres: Argumentative, Narrative, Descriptive, and Expository. We curate and refine a total of 728 real-world prompts to ensure authenticity and meticulously categorize them into the Open-Ended and Constrained sets to capture diverse writing scenarios. To evaluate generated essays reliably, we develop a genre-specific, fine-grained scoring framework that aggregates scores in a dependency-aware manner. We further validate our evaluation protocol through a comprehensive human agreement study. Finally, we benchmark 15 large-size LLMs, analyzing their strengths and limitations across genres and instruction types. Our results reveal key challenges for LLMs in Chinese essay writing and point toward promising directions for future research.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: corpus creation, benchmarking, automatic evaluation of datasets, evaluation methodologies, evaluation

Contribution Types: Data resources

Languages Studied: Chinese

Submission Number: 2988

Loading