Towards Human-Preferences Chinese Rewriting Evaluation: Prompt-Based Scoring with Large Language Models
Keywords: Sentence Rewriting, Semantic Consistency, Large Language Models, Text Generation Control
Abstract: Sentence rewriting is a core task in natural language processing, encompassing paraphrasing, translation, and summarization. Despite its importance, existing evaluation metrics often rely on superficial similarity measures (e.g., BLEU, ROUGE), which fail to capture deep semantic fidelity. In this work, we propose a principled, multi-dimensional framework for evaluating rewriting quality based on semantic consistency, syntactic structure, lexical variation, and stylistic fidelity. We design a prompt-based scoring method with the QWQ-32B language model, achieving a Spearman correlation of $\rho = 0.6121$ with human judgments, which is comparable to inter-human agreement ($\rho = 0.6076$). We further benchmark popular rewriting strategies using this metric and introduce a multiround generation pipeline that improves rewriting quality by 9.66\%. Our results show that large language models, when paired with structured evaluation and guidance, can robustly assess and generate high-quality rewrites.
Supplementary Material: zip
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 6950
Loading