Towards Human-Preferences Chinese Rewriting Evaluation: Prompt-Based Scoring with Large Language Models

Lin Hai; Shaoxiong Zhan; Hai-Tao Zheng; Hui Wang; Hong-Gee Kim

Towards Human-Preferences Chinese Rewriting Evaluation: Prompt-Based Scoring with Large Language Models

Lin Hai, Shaoxiong Zhan, Hai-Tao Zheng, Hui Wang, Hong-Gee Kim

16 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Sentence Rewriting, Semantic Consistency, Large Language Models, Text Generation Control

Abstract: Sentence rewriting is a core task in natural language processing, encompassing paraphrasing, translation, and summarization. Despite its importance, existing evaluation metrics often rely on superficial similarity measures (e.g., BLEU, ROUGE), which fail to capture deep semantic fidelity. In this work, we propose a principled, multi-dimensional framework for evaluating rewriting quality based on semantic consistency, syntactic structure, lexical variation, and stylistic fidelity. We design a prompt-based scoring method with the QWQ-32B language model, achieving a Spearman correlation of $\rho = 0.6121$ with human judgments, which is comparable to inter-human agreement ($\rho = 0.6076$). We further benchmark popular rewriting strategies using this metric and introduce a multiround generation pipeline that improves rewriting quality by 9.66\%. Our results show that large language models, when paired with structured evaluation and guidance, can robustly assess and generate high-quality rewrites.

Supplementary Material: zip

Primary Area: other topics in machine learning (i.e., none of the above)

Submission Number: 6950

Loading