Keywords: Loss Landscapes, Hessian Analysis, Optimization Dynamics, Chain-of-Thought, CoT, Value Conflicts, Scaling
TL;DR: We geometrically demonstrate that base value conflict resolution in LLMs leads to sharp Hessian eigenvalues indicating instability, and propose an annealing-inspired CoT that resolves value complexity in a smoother, more stable way.
Abstract: Current alignment paradigms, such as Reinforcement Learning from Human Feedback (RLHF), often collapse complex human values into scalar rewards, even though human values are often conflicting. We show that when models are resolving value conflicts, their loss landscape becomes unstable, indicated by a high top Hessian matrix eigenvalue and a “cliff-like” landscape. We demonstrate that chain-of-thought (CoT) reasoning lowers this top eigenvalue and smoothens the loss landscape. We further introduce an annealing-inspired CoT that enforces a transition from high-temperature exploration to low-temperature convergence, and confirm that this reasoning approach achieves even flatter, more stable minima. Our findings suggest that focusing on more intentional control of internal reasoning dynamics may be an important mechanism for building models that can more reliably navigate pluralistic environments as value complexity scales.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 137
Loading