Keywords: authorship verification, long-form story generation, RL for language models
TL;DR: This paper is about fine-tuning model to mimic classic authors writing style by using Group Relative Policy Optimisation.
Abstract: Evaluating and optimizing authorial style in long-form story generation is challenging because style judgments often rely on subjective human voting, and there is no stable automatic evaluation method. We propose a two-stage pipeline. First, we train a style-similarity judge by fine-tuning a sentence-transformer with authorship-verification supervision, and calibrate its similarity outputs into a bounded [0,1] reward. Second, we use this judge as the primary reward in Group Relative Policy Optimization (GRPO) to fine-tune an 8B story generator for style-conditioned writing, avoiding the accept/reject supervision required by Direct Preference Optimization (DPO). Across four target authors (Mark Twain, Jane Austen, Charles Dickens, Thomas Hardy), the GRPO-trained 8B model achieves higher style scores than open-weight baselines, with an average style score of 0.893 across authors. These results suggest that AV-calibrated reward modeling provides a practical mechanism for controllable long-form style transfer under moderate model size and training budget.
Scope Confirmation: To the best of my judgment, this submission falls within the scope of CoNLL.
Primary Area Selection: Theoretical Analysis and Interpretation of ML Models for NLP
Secondary Area Selection: Lexical, Compositional and Discourse Semantics
Use Of Generative Artificial Intelligence Tools: Yes, for editing/proofreading the manuscript, Yes, for writing code
Data Collection From Human Subjects: No
Submission Type: Archival: I certify that the submission has not been previously published, nor is the material in it under review by another journal or conference. Further, no material in it will be submitted for review at another conference or journal while under review by CoNLL 2026.
Submission Number: 77
Loading