Geometry of Nash Mirror Dynamics: Adaptive $\beta$-Control for Stable and Bias-Robust Self-Improving LLM Agents

Geometry of Nash Mirror Dynamics: Adaptive $\beta$-Control for Stable and Bias-Robust Self-Improving LLM Agents

ICLR 2026 Conference Submission25046 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Learning in Games

Abstract: Self‑improving agents learn by playing competitive, often non-transitive language games (e.g., generator–solver, proposer–verifier) where training can oscillate or drift toward undesirable behaviours. We study this scenario through the lens of reverse‑KL regularised Nash learning, showing how the regularisation strength $\beta$ shapes both where agents converge and how they get there. We derive a continuous‑time view of Nash Mirror Descent (Nash‑MD), revealing a simple geometry: trajectories are spirals on the simplex whose damping grows with $\beta$, while $\beta$ simultaneously pulls equilibria toward the reference policy—amplifying any existing biases. We prove last‑iterate convergence to the $\beta$‑regularised Nash equilibrium, quantify its first‑order shift from the unregularised solution, and link convergence speed to the spectrum of the linearised dynamics. Building on this geometry, we introduce two adaptive $\beta$ controllers: (i) a Hessian‑based rule that targets a desired damping–rotation ratio to accelerate without overshoot, and (ii) a bias‑based rule that caps measurable bias (e.g., output length, calibration, hallucination proxies) while retaining speed. On toy games (e.g. Rock–Paper–Scissors) and small open‑model reasoning benchmarks, our controllers deliver faster, more stable convergence with bounded bias, outperforming baselines. The result is a practical recipe: tune $\beta$ as a control knob to make self‑improving LLM agents both faster and safer.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 25046

Loading