IMBO: An Influence-based Memorize and Bregman Optimization Strategy for Continual Preference Learning
Keywords: continual learning, influence function, preference learning
Abstract: Preference learning serves as an effective approach to align Large Language Models (LLMs) with human preferences while enhancing the intuitiveness of human-AI interactions. In dynamic real-world scenarios characterized by evolving tasks and domains, continual adaptation to shifting user preferences offers significant advantages over static one-shot training paradigms. However, existing alignment frameworks like Direct Preference Optimization (DPO) lack inherent suitability for continual learning (CL) due to their static optimization objectives. This paper addresses the fundamental challenge of continual preference learning with limited memory: How to effectively construct and utilize a historical memory buffer to support stable knowledge retention while enabling adaptive alignment with evolving human preferences? First, we propose the prompt and responses Influence Functions (pIF & rIF), which selects preference data effectively and overcomes the limitation of vanilla influence functions, which are restricted to loss functions that can be decomposed into a sum of individual data points. Next, we introduce Bregman-Lagrange optimization, which prevents forgetting past preferences while simultaneously enabling adaptive alignment with evolving preference distributions. The experimental results demonstrate that our method surpasses strong continual learning baselines in both task and domain incremental preference learning settings, in terms of model and human assessment.
Submission Number: 2
Loading