Keywords: reinforcement learning, large language models, multi-turn conversations, user interactions, in-context learning, continual learning, self-distillation
TL;DR: We propose a principled and scalable approach to learn directly from raw user conversations to improve alignment, instruction following, and personalization without explicit supervision.
Abstract: Multi-turn user interactions are among the most abundant data produced by language models, yet we lack effective methods to learn from them. While typically discarded, these interactions often contain useful information: follow-up user messages may indicate that a response was incorrect, failed to follow an instruction, or did not align with the user's preferences. Importantly, language models are already able to make use of this information in context. After observing a user's follow-up, the same model is often able to revise its behavior. We leverage this ability to propose a principled and scalable method for learning directly from user interactions through self-distillation. We condition the model on the user's follow-up message and distill the resulting hindsight token distribution back into the current policy. We show that this approach enables personalization and continual adaptation without any explicit supervision, suggesting a scalable path for continual learning from the raw interaction data produced in rich deployment environments.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 46
Loading