TL;DR: We address the non-stationarity preference drift using exponential reweighting strategy for LLMs.
Abstract: Current Large Language Model (LLM) preference optimization algorithms do not account for temporal preference drift, which can lead to severe misalignment. To address this limitation, we propose **Non-Stationary Direct Preference Optimisation (NS-DPO)** that models time-dependent reward functions with a Dynamic Bradley-Terry model. NS-DPO proposes a computationally efficient solution by introducing only a single discount parameter in the loss function, which is used for exponential weighting that proportionally focuses learning on more time-relevant datapoints. We theoretically analyze the convergence of NS-DPO in a general setting where the exact nature of the preference drift is not known, providing upper bounds on the estimation error and regret caused by non-stationary preferences. Finally, we demonstrate the effectiveness of NS-DPO for fine-tuning LLMs under drifting preferences. Using scenarios where various levels of preference drift is introduced, with popular LLM reward models and datasets, we show that NS-DPO fine-tuned LLMs remain robust under non-stationarity, significantly outperforming baseline algorithms that ignore temporal preference changes, without sacrificing performance in stationary cases.
Lay Summary: Large Language Models (LLMs), like those powering chatbots and virtual assistants, are trained to align with human preferences. However, these preferences can evolve over time. For example, answer to the question "How good are the large language models (LLMs) in solving math questions?" in the year 2025 and 2020 can be largely different. Such shifts can confuse LLMs if the data contains both recent and old information.
Our research introduces a method called Non-Stationary Direct Preference Optimization (NS-DPO). This approach assigns more weight to recent data during training, helping LLMs stay attuned to current human preferences. We provide both theoretical analysis and experimental evidence showing that NS-DPO maintains model performance even as preferences change over time.
As LLMs become increasingly integrated into daily life, ensuring they adapt to evolving human perspectives is crucial. Our work offers a step towards more reliable AI systems.
Link To Code: https://github.com/geronest/ns-dpo
Primary Area: Deep Learning->Large Language Models
Keywords: LLM, fine-tuning, DPO, non-stationarity, preference drift, RLHF
Submission Number: 7825
Loading