I²B-LPO: Latent Policy Optimization via Iterative Information Bottleneck

ACL ARR 2026 January Submission726 Authors

24 Dec 2025 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reinforcement Learning, Large Language Model, Entropy, Information Bottleneck, Latent Space Reasoning
Abstract: Despite recent advances in Reinforcement learning with verifiable rewards (RLVR) for large language model (LLM) reasoning, most methods suffer from exploration collapse, as the semantic homogeneity of random rollouts traps models in narrow, over-optimized behaviors. Existing methods leverage policy entropy to encourage exploration, but face inherent limitations: global entropy regularization is susceptible to reward hacking, inducing meaningless verbosity, whereas local token-selective updates struggle with the strong inductive bias of pre-trained models. To this end, we propose Latent Policy Optimization via Iterative Information Bottleneck ( I²B-LPO), which shifts from statistical perturbation of token distributions to topological branching of reasoning trajectories. I²BLPO triggers latent branching at high‑entropy states to diversify reasoning trajectories and applies the Information Bottleneck as a trajectory filter and self-reward to ensure concise and informative exploration. Empirical results on four mathematical benchmarks demonstrate that I²B-LPO achieves state-of-the-art performance, with margins of up to 5.3% in accuracy and 7.4% in diversity metrics. Code is available at https://anonymous.4open.science/r/Anonymous-code-2CDB/.
Paper Type: Long
Research Area: Mathematical, Symbolic, Neurosymbolic, and Logical Reasoning
Research Area Keywords: Reinforcement Learning, Large Language Model Reasoning, Entropy, Information Bottleneck, Latent Space Reasoning
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Theory
Languages Studied: English
Submission Number: 726
Loading