Keywords: Reinforcement Learning, Agent Alignment
TL;DR: One-shot style alignment for RL agents via latent inference from a single trajectory and reward-guided finetuning, enabling controllable and generalizable behavior
Abstract: Reinforcement learning (RL) has achieved remarkable success in training agents with high-performing policies, and recent works have begun to address the critical challenge of aligning such policies with human preferences. While these efforts have shown promise, most approaches rely on large-scale data and do not generalize well to novel forms of preferences. In this work, we formalize one-shot style alignment as an extension of the preference alignment paradigm. The goal is to enable RL agents to adapt to human-specified styles from a single example, thereby eliminating the reliance on large-scale datasets and the need for retraining. To achieve this, we propose a framework that infers an interpretable latent style vector through a learned discriminator and adapts a pretrained base policy using a style reward signal during online interaction.Our design enables controllable and data-efficient alignment with target styles while maintaining strong task performance, and further enables smooth interpolation across unseen style compositions. Experiments across diverse environments with varying style preferences demonstrate precise style alignment, strong generalization, and task competence.
Primary Area: reinforcement learning
Submission Number: 25633
Loading