Abstract: The rapid proliferation of large language models (LLMs) highlights an urgent need for evaluation frameworks that cover a wide range of writing tasks but also deliver reliable and nuanced evaluation results. However, current benchmarks are limited in scope, lacking both comprehensive coverage of specialized writing tasks and the granularity required for precise requirements. Moreover, existing static evaluation methods fall short in capturing stylistic and contextual fidelity, particularly when applied to diverse and complex writing tasks.
To tackle these challenges, we present WritingBench, a comprehensive benchmark comprising 1,239 queries spanning 6 domains and 100 subdomains with diverse material contexts, designed to evaluate multi-dimensional requirements such as style, format, and length.
We further propose a query-dependent evaluation framework enabling LLMs to dynamically generate task-specific assessment criteria. This framework is complemented by a fine-tuned critic model for criteria-aware scoring, ensuring fine-grained evaluations across a wide range of writing tasks. Leveraging the precise feedback from this evaluation process, we further filter synthesized data to train a writing-enhanced model, which demonstrates superior performance, achieving a 21% improvement in human evaluation over baseline models.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: Writing, LLM, Evaluation
Languages Studied: English, Chinese
Submission Number: 8112
Loading