PARM: Multi-Objective Test-Time Alignment via Preference-Aware Autoregressive Reward Model

Baijiong Lin; Weisen Jiang; Yuancheng Xu; Hao Chen; Ying-Cong Chen

PARM: Multi-Objective Test-Time Alignment via Preference-Aware Autoregressive Reward Model

Baijiong Lin, Weisen Jiang, Yuancheng Xu, Hao Chen, Ying-Cong Chen

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Multi-objective test-time alignment aims to adapt large language models (LLMs) to diverse multi-dimensional user preferences during inference while keeping LLMs frozen. Recently, GenARM (Xu et al., 2025) first independently trains Autoregressive Reward Models (ARMs) for each preference dimension without awareness of each other, then combines their outputs based on user-specific preference vectors during inference to achieve multi-objective test-time alignment, leading to two key limitations: the need for *multiple* ARMs increases the inference cost, and the *separate* training of ARMs causes the misalignment between the guided generation and the user preferences. To address these issues, we propose Preference-aware ARM (PARM), a *single* unified ARM trained across *all* preference dimensions. PARM uses our proposed Preference-Aware Bilinear Low-Rank Adaptation (PBLoRA), which employs a bilinear form to condition the ARM on preference vectors, enabling it to achieve precise control over preference trade-offs during inference. Experiments demonstrate that PARM reduces inference costs and achieves better alignment with preference vectors compared with existing methods. Additionally, PARM enables weak-to-strong guidance, allowing a smaller PARM to guide a larger frozen LLM without expensive training, making multi-objective alignment accessible with limited computing resources. The code is available at https://github.com/Baijiong-Lin/PARM.

Lay Summary: Large language models (LLMs) are incredibly powerful but often need to be fine-tuned to align with specific user preferences, such as being helpful, harmless, or humorous. Current methods require fine-tuning the entire LLM or training multiple reward models for different preferences, which is computationally expensive and leads to misalignment between the LLM's generation and the user preferences. We developed PARM, a Preference-Aware Autoregressive Reward Model, which is a single unified model used to dynamically adjust LLM's responses based on user preferences during inference while keeping LLMs frozen. PARM employs a technique called Preference-Aware Bilinear Low-Rank Adaptation (PBLoRA) to condition the model on a preference vector, allowing it to balance different preferences during inference without retraining. PARM significantly reduces the computational cost and improves the alignment between LLM's generations and user preferences. It enables weak-to-strong guidance, allowing a smaller PARM to guide a larger frozen LLM without expensive training. This makes multi-objective alignment accessible with limited computing resources, making AI assistants more versatile and efficient.

Primary Area: Deep Learning->Large Language Models

Keywords: LLM Alignment, Test-time Alignment, Multi-objective Alignment

Submission Number: 4525

Loading