Bounded Rationality for LLMs: Satisficing Alignment at Inference-Time

Mohamad Fares El Hajj Chehade; Soumya Suvra Ghosal; Souradip Chakraborty; Avinash Reddy; Dinesh Manocha; Hao Zhu; Amrit Singh Bedi

Bounded Rationality for LLMs: Satisficing Alignment at Inference-Time

Mohamad Fares El Hajj Chehade, Soumya Suvra Ghosal, Souradip Chakraborty, Avinash Reddy, Dinesh Manocha, Hao Zhu, Amrit Singh Bedi

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We propose an inference-time alignment method for a language model that can accommodate multiple user criteria simultaneously.

Abstract: Aligning large language models with humans is challenging due to the inherently multifaceted nature of preference feedback. While existing approaches typically frame this as a multi-objective optimization problem, they often overlook how humans actually make decisions. Research on bounded rationality suggests that human decision making follows satisficing strategies- optimizing primary objectives while ensuring others meet acceptable thresholds. To bridge this gap and operationalize the notion of satisficing alignment, we propose SITAlign: an inference-time framework that addresses the multifaceted nature of alignment by maximizing a primary objective while satisfying threshold-based constraints on secondary criteria. We provide theoretical insights by deriving sub-optimality bounds of our satisficing-based inference alignment approach. We empirically validate SITAlign's performance through extensive experimentation on multiple benchmarks. For instance, on the PKU-SafeRLHF dataset with the primary objective of maximizing helpfulness while ensuring a threshold on harmlessness, SITAlign outperforms the state-of-the-art multi-objective decoding strategy by a margin of 22.3% in terms of GPT-4 win-tie rate for helpfulness reward while adhering to the threshold on harmlessness.

Lay Summary: Making language models act the way humans want is tricky because "good" behavior involves many different aspects, like being helpful and safe, which can sometimes conflict. Other methods try to make language models perfect at everything simultaneously. Our research proposes "SITAlign," a framework that, like humans, prioritizes one main goal (e.g., helpfulness) while ensuring other aspects (e.g., safety) meet a "good enough" standard.

Primary Area: Deep Learning->Large Language Models

Keywords: alignment, language models, fine-tuning, controlled decoding, inference time

Submission Number: 11713

Loading