PRISM: Pareto-Responsive Iterative Sampling with DPO for Multi-objective Planning

PRISM: Pareto-Responsive Iterative Sampling with DPO for Multi-objective Planning

ICLR 2026 Conference Submission18210 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: DPO, Multi-Objective Optimization, Planner

TL;DR: PRISM makes LLM planners accurate, efficient, and robust by aligning them with multiple objectives in a single training pass.

Abstract: Many planning-style applications of large language models are inherently multi-objective. Beyond correctness, users care about efficiency and the avoidance of irrelevant or unsafe actions. Yet most alignment pipelines optimize a single scalar reward, which hides trade-offs and offers little control when secondary objectives have uncertain or deployment-specific weights. We present PRISM, a Pareto responsive framework that integrates Direct Preference Optimization. PRISM adds three components designed for offline, several convergence toward balanced solutions. First, it uses golden comparisons that isolate per-objective preferences. Second, it computes attention-style weights from deficiency diagnostics that combine loss and gradient information. Third, it applies Pareto guided sampling that orients preference pairs by cosine alignment with the current weight direction.This loop performs common-descent updates for a vector of objective deficiencies and stops at a certificate of first-order Pareto stationarity. It removes the need for online reinforcement learning, reward sweeps, or families of specialist models. On six benchmarks in question answering, coding, and mathematical reasoning, PRISM improves accuracy over strong baselines while simultaneously reducing latency and step count and driving off-domain actions to near zero. PRISM provides a principled and compute efficient recipe for robust multi-objective alignment of LLM-based planners.

Supplementary Material: pdf

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Submission Number: 18210

Loading