Bounded Rationality for LLMs: Satisficing Alignment at Inference-Time

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We propose an inference-time alignment method for a language model that can accommodate multiple user criteria simultaneously.
Abstract: Aligning large language models with humans is challenging due to the inherently multifaceted nature of preference feedback. While existing approaches typically frame this as a multi-objective optimization problem, they often overlook how humans actually make decisions. Research on bounded rationality suggests that human decision making follows satisficing strategies- optimizing primary objectives while ensuring others meet acceptable thresholds. To bridge this gap and operationalize the notion of satisficing alignment, we propose SITAlign: an inference-time framework that addresses the multifaceted nature of alignment by maximizing a primary objective while satisfying threshold-based constraints on secondary criteria. We provide theoretical insights by deriving sub-optimality bounds of our satisficing-based inference alignment approach. We empirically validate SITAlign's performance through extensive experimentation on multiple benchmarks. For instance, on the PKU-SafeRLHF dataset with the primary objective of maximizing helpfulness while ensuring a threshold on harmlessness, SITAlign outperforms the state-of-the-art multi-objective decoding strategy by a margin of 22.3% in terms of GPT-4 win-tie rate for helpfulness reward while adhering to the threshold on harmlessness.
Lay Summary: Making language models act the way humans want is tricky because "good" behavior involves many different aspects, like being helpful and safe, which can sometimes conflict. Other methods try to make language models perfect at everything simultaneously. Our research proposes "SITAlign," a framework that, like humans, prioritizes one main goal (e.g., helpfulness) while ensuring other aspects (e.g., safety) meet a "good enough" standard.
Primary Area: Deep Learning->Large Language Models
Keywords: alignment, language models, fine-tuning, controlled decoding, inference time
Submission Number: 11713
Loading