UniCreative: Unifying Long-form Logic and Short-form Sparkle via Reference-Free Reinforcement Learning
Keywords: Generative Reward Model, Reinforcement Learning, Creative Writing
Abstract: A fundamental challenge in creative writing lies in reconciling the inherent tension between maintaining global coherence in long-form narratives and preserving local expressiveness in short-form texts.
While long-context generation necessitates explicit macroscopic planning, short-form creativity often demands spontaneous, constraint-free expression.
Existing alignment paradigms, however, typically employ static reward signals and rely heavily on high-quality supervised data, which is costly and difficult to scale.
To address this, we propose UniCreative, a unified reference-free reinforcement learning framework.
We first introduce AC-GenRM, an adaptive constraint-aware reward model that dynamically synthesizes query-specific criteria to provide fine-grained preference judgments.
Leveraging these signals, we propose ACPO, a policy optimization algorithm that aligns models with human preferences across both content quality and structural paradigms without supervised fine-tuning and ground-truth references.
Empirical results demonstrate that AC-GenRM aligns closely with expert evaluations, while ACPO significantly enhances performance across diverse writing tasks.
Crucially, our analysis reveals an emergent meta-cognitive ability: the model learns to autonomously differentiate between tasks requiring rigorous planning and those favoring direct generation, validating the effectiveness of our direct alignment approach.
Paper Type: Long
Research Area: Natural Language Generation
Research Area Keywords: Generation
Contribution Types: NLP engineering experiment
Languages Studied: English,Chinese
Submission Number: 3182
Loading