Abstract: Ensuring that large language models (LLMs) are both helpful and harmless is a critical challenge, as overly strict constraints can lead to excessive refusals, while permissive models risk generating harmful content. Existing approaches, such as reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO), attempt to balance these trade-offs but suffer from performance conflicts, limited controllability, and poor extendability. To address these issues, we propose Preference Vector, a novel framework inspired by task arithmetic. Instead of optimizing multiple preferences within a single objective, we train separate models on individual preferences, extract behavior shifts as preference vectors, and dynamically merge them at test time. This modular approach enables fine-grained, user-controllable preference adjustments and facilitates seamless integration of new preferences without retraining. Experiments show that our proposed Preference Vector framework improves helpfulness without excessive conservatism, allows smooth control over preference trade-offs, and supports scalable multi-preference alignment.
Paper Type: Long
Research Area: Ethics, Bias, and Fairness
Research Area Keywords: Large Language Models, RLHF, Safety, Alignment, Task arithmetic
Contribution Types: Model analysis & interpretability
Languages Studied: English
Reassignment Request Area Chair: This is not a resubmission
Reassignment Request Reviewers: This is not a resubmission
A1 Limitations Section: This paper has a limitations section.
A2 Potential Risks: Yes
A2 Elaboration: 8
B Use Or Create Scientific Artifacts: Yes
B1 Cite Creators Of Artifacts: Yes
B1 Elaboration: 5.1, B.1
B2 Discuss The License For Artifacts: Yes
B2 Elaboration: B.1
B3 Artifact Use Consistent With Intended Use: Yes
B3 Elaboration: B.1
B4 Data Contains Personally Identifying Info Or Offensive Content: Yes
B4 Elaboration: B.1
B5 Documentation Of Artifacts: Yes
B5 Elaboration: B.1
B6 Statistics For Data: Yes
B6 Elaboration: B.1
C Computational Experiments: Yes
C1 Model Size And Budget: Yes
C1 Elaboration: 5.1, B.1
C2 Experimental Setup And Hyperparameters: Yes
C2 Elaboration: 5.1, B.2, B.6, B.7
C3 Descriptive Statistics: Yes
C3 Elaboration: 5, B.1
C4 Parameters For Packages: Yes
C4 Elaboration: B.1
D Human Subjects Including Annotators: Yes
D1 Instructions Given To Participants: Yes
D1 Elaboration: B.5
D2 Recruitment And Payment: Yes
D2 Elaboration: B.5
D3 Data Consent: Yes
D3 Elaboration: B.5
D4 Ethics Review Board Approval: No
D4 Elaboration: No, we asked our departments ethics board, and they informed us that no application was required for our work.
D5 Characteristics Of Annotators: Yes
D5 Elaboration: B.5
E Ai Assistants In Research Or Writing: Yes
E1 Information About Use Of Ai Assistants: Yes
E1 Elaboration: Yes. We used ChatGPT for proofreading during the writing process.
Author Submission Checklist: yes
Submission Number: 1357
Loading