Humanline: Online Alignment as Perceptual Loss

Published: 26 Jan 2026, Last Modified: 12 Feb 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: alignment, LLM, LLM alignment, prospect theory, perceptual loss, behavioral economics
TL;DR: Online alignment objectives (e.g., GRPO) mimic how humans perceive probability. By modifying offline objectives to do the same, we can match the performance of online alignment with offline off-policy data, giving us the best of both worlds.
Abstract: Online alignment (e.g., GRPO) is generally more performant than offline alignment (e.g., DPO)---but why? Drawing on prospect theory from behavioral economics, we propose a human-centric explanation. We prove that online on-policy sampling better approximates the human-perceived distribution of what the model can produce, and PPO/GRPO-style clipping---originally introduced to just stabilize training---recovers a perceptual bias in how humans perceive probability. In this sense, PPO/GRPO act as perceptual losses already. Our theory further suggests that the online/offline dichotomy is itself incidental to maximizing human utility, since we can achieve the same effect by selectively training on any data in a manner that mimics human perception, rather than restricting ourselves to online on-policy data. Doing so would allow us to post-train more quickly, cheaply, and flexibly without sacrificing performance. To this end, we propose a design pattern that explicitly incorporates perceptual distortions of probability into objectives like DPO/KTO/GRPO, creating $\textit{humanline variants}$ of them. Surprisingly, we find that these humanline variants, even when trained with offline off-policy data, can match the performance of their online counterparts on both verifiable and unverifiable tasks.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 5530
Loading