Stability of Preference Alignment for Multi-Turn Control with LLM Policies

Published: 06 Oct 2025, Last Modified: 04 Nov 2025MTI-LLM @ NeurIPS 2025 PosterEveryoneRevisionsBibTeXCC BY-ND 4.0
Keywords: LLM alignment, multi-turn interaction, preference learning, embodied control
TL;DR: We systematically study preference alignment methods for LLM-based multi-turn control, finding that GRPO with behavior cloning regularization improves stability in both gridworld and shared-control racing tasks.
Abstract: Large language models (LLMs) are increasingly deployed in multi-turn control settings, such as interface navigation and robot manipulation, where stability over long horizons is critical. In this work, we provide a study of preference alignment methods, including group-relative policy optimization (GRPO), direct preference optimization (DPO), contrastive preference optimization (CPO), and a GRPO variant with behavior cloning regularization, in two domains: a tokenized gridworld and a shared-control racing task that necessitates long-horizon planning and interaction. Rather than proposing a new algorithm, our goal is to analyze stability trade-offs and clarify when existing approaches succeed or fail. We show that (1) contrastive methods such as DPO and CPO risk policy degradation without valid negatives, (2) such methods struggle to recover multi-modal behaviors from a pre-trained initialization, and (3) adding behavior cloning regularization to GRPO improves robustness in some multi-turn settings. Together, our findings provide practical guidance for applying alignment techniques to long-horizon interactive policies and highlight open challenges for stable, preference-aware LLM control.
Submission Number: 27
Loading