Beyond Expectations: Quantile-Guided Alignment for Risk-Calibrated Language Models

Xinran Wang; Jin Du; Azal Ahmad Khan; Qi Le; Enmao Diao; Jiawei Zhou; Jie Ding; Ali Anwar

Beyond Expectations: Quantile-Guided Alignment for Risk-Calibrated Language Models

Xinran Wang, Jin Du, Azal Ahmad Khan, Qi Le, Enmao Diao, Jiawei Zhou, Jie Ding, Ali Anwar

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 spotlightEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Alignment, Large language models

TL;DR: Quantile-Guided Alignment (QA) is a framework for multi-dimensional quantile alignment that reduces catastrophic risks in language models by imposing constraints on reward quantiles across multiple performance dimensions.

Abstract: Large language models can generate rare but catastrophic outputs, such as harmful conversations or insecure code. Existing Reinforcement Learning from Human Feedback (RLHF) typically maximizes average reward, leaving high-risk tail events insufficiently controlled. We introduce Quantile‑Guided Alignment (QA), a framework that allows users to specify desired improvements at any quantile—individually or across multiple reward dimensions—thus shifting the distribution of outputs with finer control toward safer, more desirable outcomes. The method extends standard RLHF via an augmented reward formulation that enforces quantile constraints. Experiments on conversation and code‐generation tasks show that quantile alignment significantly enhances quality at targeted tails while maintaining overall performance. The results position QA as a principled route to risk‑calibrated language models with tail‑focused alignment.

Supplementary Material: zip

Primary Area: Social and economic aspects of machine learning (e.g., fairness, interpretability, human-AI interaction, privacy, safety, strategic behavior)

Submission Number: 18711

Loading