MPAW: Multi-Preference Alignment through Weak Model Collaboration for Efficient and Flexible LLM Decoding
Keywords: Multi-objective alignment, Weak-to-strong alignment, Decoding-time Optimization, Large Language Models
TL;DR: MPAW is an efficient and flexible framework for aligning LLMs with multiple user-defined preferences, using weak models for decoding without retraining, offering significant computational savings while maintaining high alignment quality.
Abstract: Aligning large language models (LLMs) with diverse and competing human preferences remains a critical challenge for safe and effective deployment. While recent work demonstrates that decoding-time alignment via weak preference models achieves strong performance with minimal compute, existing methods optimize for single objectives, severely limiting their adaptability to real-world scenarios requiring multifaceted trade-offs (e.g., safety vs. helpfulness). We propose Multi-Preference Alignment through Weak Model Collaboration (\texttt{MPAW}), a scalable framework that aggregates guidance from heterogeneous weak preference models-smaller LLMs aligned to distinct objectives-into a unified decoding strategy. By dynamically integrating signals from specialized proxies (e.g., safety classifiers, conciseness scorers), \texttt{MPAW} preserves the generalization capabilities of large base models while enabling zero-shot adaptation to arbitrary preference weightings. Empirical results demonstrate reliable alignment quality and nearly matching the performance of computationally expensive multi-objective RLHF fine-tuning. Our findings establish weak model collaboration as a principled pathway for efficient, flexible LLM alignment without retraining.
Submission Number: 77
Loading