Evaluating the Evaluators: Investigating LLM Judges for Personalized Writing Style Assessment

Evaluating the Evaluators: Investigating LLM Judges for Personalized Writing Style Assessment

ICLR 2026 Conference Submission20183 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM judges, writing style personalization

TL;DR: An investigation into the use of LLM judges to measure the writing style personalization of LLM generated text.

Abstract: We consider the problem of determining how well large language models (LLMs) are able to judge LLM-generated text when a generator is prompted to align with a specific writing style. To illustrate, such an issue is important in a scenario where a user’s preferred writing style is known (e.g., "inspirational") and an LLM is used as a judge to evaluate whether generated text adheres to this preference. In this paper, we evaluate performance on two judge tasks: style detection and style quality pairwise ranking. We focus on how the (1) writing task, (2) generator-judge relationship, and (3) general commonsense and reasoning LLM ability impact the judge LLMs performance. To this end, we collected human style detection and pairwise ranking labels on text generated from four models for three generation tasks (email, tweet, and summary writing) that we use to assess LLM judging performance. We find that judge quality correlates strongly with general LLM ability measured using MMLU (Pearson r=0.87), varies by writing task (performance is highest for email by 28%), and is consistent across most judging strategies. We likewise find that LLM evaluators are more consistent and reliable when using AB comparisons rather than rubric-based scoring for style ranking. Finally, we find that for style detection, using the LLM with the strongest general capabilities is best, however this is not true for style quality pairwise ranking, as the strongest models rely on details humans are insensitive to when identifying the better response.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 20183

Loading