Keywords: LLM judges, writing style personalization
TL;DR: An investigation into the use of LLM judges to measure the writing style personalization of LLM generated text.
Abstract: We consider the problem of determining how well large language models (LLMs) are able to judge LLM-generated text when a generator is prompted to align with a specific writing style. To illustrate, such an issue is important in a scenario where a user’s preferred writing style is known (e.g., "inspirational") and an LLM is used as a judge to evaluate whether generated text adheres to this preference. In this paper, we evaluate performance on two judge tasks: style detection and style quality pairwise ranking. We focus on how the (1) writing task, (2) generator-judge relationship, and (3) general commonsense and reasoning LLM ability impact the judge LLMs performance. To this end, we collected human style detection and pairwise ranking labels on text generated from four models for three generation tasks (email, tweet, and summary writing) that we use to assess LLM judging performance. We find that judge quality correlates strongly with general LLM ability measured using MMLU (Pearson r=0.87), varies by writing task (performance is highest for email by 28%), and is consistent across most judging strategies. We likewise find that LLM evaluators are more consistent and reliable when using AB comparisons rather than rubric-based scoring for style ranking. Finally, we find that for style detection, using the LLM with the strongest general capabilities is best, however this is not true for style quality pairwise ranking, as the strongest models rely on details humans are insensitive to when identifying the better response.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 20183
Loading