Navigate the Unknown: Enhancing LLM Reasoning with Intrinsic Motivation Guided Exploration

Jingtong Gao; Ling Pan; Yejing Wang; Rui Zhong; Chi Lu; Qingpeng Cai; Peng Jiang; Xiangyu Zhao

Navigate the Unknown: Enhancing LLM Reasoning with Intrinsic Motivation Guided Exploration

Jingtong Gao, Ling Pan, Yejing Wang, Rui Zhong, Chi Lu, Qingpeng Cai, Peng Jiang, Xiangyu Zhao

19 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Model, Reasoning, Reinforcement Learning, Exploration

TL;DR: Existing RL methods for LLMs struggle with sparse rewards and poor exploration. i-MENTOR uses intrinsic motivation guided exploration to provide dense exploration rewards and enhance performance across datasets.

Abstract: Reinforcement Learning (RL) has emerged as a pivotal method for improving the reasoning capabilities of Large Language Models (LLMs). However, prevalent RL approaches such as Proximal Policy Optimization (PPO) and Group-Relative Policy Optimization (GRPO) face critical limitations due to their reliance on sparse outcome-based rewards and inadequate mechanisms for incentivizing exploration. These limitations result in inefficient guidance for reasoning. Specifically, sparse rewards fail to deliver sufficient feedback, particularly for challenging problems. Furthermore, such rewards induce systematic biases that prioritize exploitation of familiar trajectories over novel solution discovery. These shortcomings critically hinder performance in complex reasoning tasks, which inherently demand iterative refinement across intermediate steps. To address these challenges, we propose an Intrinsic Motivation guidEd exploratioN meThOd foR LLM Reasoning (i-MENTOR), a method designed to deliver dense rewards and amplify exploration in the RL-based paradigm. i-MENTOR introduces three innovations: trajectory-aware exploration rewards that mitigate bias in token-level strategies while maintaining computational efficiency; error-conditioned reward allocation to ensure efficient exploration on challenging samples while intrinsically stabilizing training; and advantage-preserving integration that maintains advantage distribution integrity while incorporating exploratory guidance. Experiments across 4 public datasets demonstrate i-MENTOR's effectiveness, achieving a 22.23\% improvement on AIME 2024.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 15472

Loading