More Than One Teacher: Adaptive Multi-Guidance Policy Optimization for Diverse Exploration

XiaoYang Yuan; Yujuan Ding; Yi Bin; Wenqi Shao; Jinyu Cai; Jingkuan Song; Yang Yang; Heng Tao Shen

More Than One Teacher: Adaptive Multi-Guidance Policy Optimization for Diverse Exploration

XiaoYang Yuan, Yujuan Ding, Yi Bin, Wenqi Shao, Jinyu Cai, Jingkuan Song, Yang Yang, Heng Tao Shen

10 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reinforcement Learning, Large Language Models, Reasoning, Exploration, Multi-teacher Learning

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) is a promising paradigm for enhancing the reasoning ability in Large Language Models (LLMs). However, prevailing methods primarily rely on self-exploration or a single off-policy teacher to elicit long chain-of-thought (LongCoT) reasoning, which may introduce intrinsic model biases and restrict exploration, ultimately limiting reasoning diversity and performance. Grounded in the principle of the "Wisdom of Crowds" and drawing inspiration from multi-teacher strategies in knowledge distillation, we introduce **A**daptive **M**ulti-Guidance **P**olicy **O**ptimization (**AMPO**), a novel framework that adaptively leverages guidance from multiple proficient teacher models, but only when the on-policy model fails to generate correct solutions. This "guidance-on-demand" approach expands exploration while preserving the value of self-discovery. Moreover, AMPO incorporates a comprehension-based selection mechanism, prompting the student to learn from the reasoning paths that it is most likely to comprehend, thus balancing broad exploration with effective exploitation. Extensive experiments show AMPO substantially outperforms a strong baseline (GRPO), with a **4.3%** improvement on mathematical reasoning tasks and **10.9%** on out-of-distribution tasks, while significantly boosting Pass@k performance and enabling more diverse exploration. Notably, using four peer-sized teachers, our method outperforms a single, more powerful teacher (*e.g.*, DeepSeek-R1) under the same data budget and achieves comparable results to approaches trained with significantly more data. These results demonstrate a more efficient and scalable path to superior reasoning and generalizability. *Our code is available at [https://anonymous.4open.science/r/7fBQd46C](https://anonymous.4open.science/r/7fBQd46C), which will be made public after double-blind review*.

Primary Area: reinforcement learning

Submission Number: 3689

Loading