Holistic Evaluation for LLM’s Capability in Human-level Writing using Tree of Writing

ACL ARR 2025 May Submission7838 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Evaluating the writing capabilities of large language models (LLMs) remains a significant challenge due to the multidimensional nature of writing skills and the limitations of existing metrics. LLM's performance in thousand-words level and open-ended writing is inadequately assessed by traditional reference-based metrics or modern LLM-as-a-judge methods. We propose Tree-of-Writing (ToW), aiming to solve the implicit inconsistency often found when LLM-as-a-judge aggregates all sub-features in text evaluation.ToW incorporates a tree-structured workflow by explicitly modeling the aggregation weights of sub-features. We also present HowToBench, a large-scale Chinese writing benchmark encompassing $\mathbf{12}$ genres and $\mathbf{1302}$ instructions across three task categories: contextual completion, outline guided writing, and open-ended generation. ToW successfully mitigates the biases, achieving a $\mathbf{0.93}$ Pearson correlation with human judgments. Furthermore, we detect that both overlap-based text generation metrics and popular LLM-as-a-judge practices are vulnerable to textual disturbances, while ToW are robust to them. We also uncover a negative correlation between input length and content related scores in Guide task, showcasing that LLM writings cannot be simply improved by input-side information piling.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: Evaluation, large language models
Contribution Types: Model analysis & interpretability, Data resources
Languages Studied: Chinese
Submission Number: 7838
Loading