Iterative Dual-Model Alignment for Story Evaluation

Bruce Qin, Dan Goldwasser

Published: 05 Jul 2026, Last Modified: 09 May 2026ACL2026EveryoneCC BY 4.0

Abstract: Large language models (LLMs) can both evaluate and explain text quality; however, most existing evaluators operate as static classifiers and lack the ability to refine their reasoning through interaction. We propose an \textbf{Iterative Alpha--Beta Learning} framework that jointly trains two complementary 8B models: an Alpha ($\alpha$) classifier that assesses pairwise story engagement, and a Beta ($\beta$) generator that produces structured, rubric-guided comparative explanations. The two models co-evolve within a closed feedback loop: $\alpha$ provides probabilistic preference signals to guide $\beta$’s Direct Preference Optimization (DPO), while $\beta$’s improved explanations are reintegrated to retrain $\alpha$ via a KL-based contrastive objective. This dual optimization enables mutual learning: $\alpha$ gains interpretability and robustness from $\beta$’s textual rationales, while $\beta$ acquires stronger alignment and discriminative precision from $\alpha$’s confidence deltas. Experiments on human-annotated story-pair datasets (\textsc{HANNA}) show that the proposed system consistently outperforms strong single-model baselines in both accuracy and explanation quality across multiple iterative rounds.