WritingBench: A Comprehensive Benchmark for Generative Writing

WritingBench: A Comprehensive Benchmark for Generative Writing

ACL ARR 2025 February Submission8112 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: The rapid proliferation of large language models (LLMs) highlights an urgent need for evaluation frameworks that cover a wide range of writing tasks but also deliver reliable and nuanced evaluation results. However, current benchmarks are limited in scope, lacking both comprehensive coverage of specialized writing tasks and the granularity required for precise requirements. Moreover, existing static evaluation methods fall short in capturing stylistic and contextual fidelity, particularly when applied to diverse and complex writing tasks. To tackle these challenges, we present WritingBench, a comprehensive benchmark comprising 1,239 queries spanning 6 domains and 100 subdomains with diverse material contexts, designed to evaluate multi-dimensional requirements such as style, format, and length. We further propose a query-dependent evaluation framework enabling LLMs to dynamically generate task-specific assessment criteria. This framework is complemented by a fine-tuned critic model for criteria-aware scoring, ensuring fine-grained evaluations across a wide range of writing tasks. Leveraging the precise feedback from this evaluation process, we further filter synthesized data to train a writing-enhanced model, which demonstrates superior performance, achieving a 21% improvement in human evaluation over baseline models.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: Writing, LLM, Evaluation

Languages Studied: English, Chinese

Submission Number: 8112

Loading