Importance Weighting for Aligning Language Models under Deployment Distribution Shift

Thanawat Lodkaew; Tongtong Fang; Takashi Ishida; Masashi Sugiyama

Importance Weighting for Aligning Language Models under Deployment Distribution Shift

Thanawat Lodkaew, Tongtong Fang, Takashi Ishida, Masashi Sugiyama

Published: 31 Jul 2025, Last Modified: 31 Jul 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Authors that are also TMLR Expert Reviewers: ~Takashi_Ishida1

Abstract: Aligning language models (LMs) with human preferences remains challenging partly because popular approaches, such as reinforcement learning from human feedback and direct preference optimization (DPO), often assume that the training data is sufficiently representative of the environment in which the model will be deployed. However, real-world applications frequently involve distribution shifts, e.g., changes in end-user behavior or preferences during usage or deployment, which pose a significant challenge to LM alignment approaches. In this paper, we propose an importance weighting method tailored for DPO, namely IW-DPO, to address distribution shifts in LM alignment. IW-DPO can be applied to joint distribution shifts in the prompts, responses, and preference labels without explicitly assuming the type of distribution shift. Our experimental results on various distribution shift scenarios demonstrate the usefulness of IW-DPO.

Certifications: Expert Certification

Submission Length: Regular submission (no more than 12 pages of main content)

Code: https://github.com/ishida-lab/IW-DPO

Assigned Action Editor: ~Huazheng_Wang1

Submission Number: 4220

Loading