MIRA: Memory-Integrated Reinforcement Learning Agent  with Limited LLM Guidance

MIRA: Memory-Integrated Reinforcement Learning Agent with Limited LLM Guidance

ICLR 2026 Conference Submission14351 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reinforcement learning (RL), Large language models (LLMs), Memory Graph, LLM-derived priors, Sample Efficiency, Sparse-Reward Environments

TL;DR: MIRA integrates LLM guidance into RL through a memory graph and utility-shaped advantages, achieving faster learning with far fewer queries.

Abstract: Reinforcement learning (RL) agents often face high sample complexity in sparse or delayed reward settings, due to limited prior knowledge. Conversely, large language models (LLMs) can provide subgoal structures, plausible trajectories, and abstract priors that support early learning. Yet heavy reliance on LLMs introduces scalability issues and risks dependence on unreliable signals, motivating ongoing efforts to integrate LLM guidance without compromising RL’s autonomy. We propose MIRA (\underline{M}emory-\underline{I}ntegrated \underline{R}einforcement Learning \underline{A}gent), which augments learning with a structured and evolving \textit{memory graph}. This graph stores decision-relevant information, such as trajectory segments and subgoal decompositions, and is co-constructed from the agent’s high-return experiences and LLM outputs. From this structure, we derive a \textit{utility} signal that integrates with advantage estimation to refine policy updates without overriding the reward signal. By incorporating LLM-derived priors in memory rather than relying on continuous queries, MIRA reduces dependence on real-time supervision. As training progresses, the agent’s policy outgrows the initial LLM-derived priors, and the utility term decays, leaving long-term convergence guarantees intact. We establish theoretical guarantees that this utility-based shaping improves early-stage learning in sparse reward settings. Empirically, MIRA outperforms RL baselines and achieves final returns comparable to approaches that depend on frequent LLM supervision, while requiring substantially fewer online LLM queries.

Supplementary Material: zip

Primary Area: reinforcement learning

Submission Number: 14351

Loading