Recursive Introspection: Teaching LLM Agents How to Self-Improve

Yuxiao Qu; Tianjun Zhang; Naman Garg; Aviral Kumar

Recursive Introspection: Teaching LLM Agents How to Self-Improve

Yuxiao Qu, Tianjun Zhang, Naman Garg, Aviral Kumar

Published: 17 Jun 2024, Last Modified: 19 Jul 20242nd SPIGM @ ICML PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Model, Reinforcement Learning, Self-Improvement

TL;DR: This paper presents RISE, a fine-tuning approach that enables language models to iteratively improve their own responses over multiple turns.

Abstract: A central piece in enabling intelligent agentic behavior in foundation models is to make them capable of introspecting upon their behavior, to reason and correct their mistakes. However, powerful proprietary large language models (LLMs) lack the ability to sequentially improve their responses, even when explicitly informed about their mistakes. In this paper, we develop $\textbf{RISE}$: $\textbf{R}$ecursive $\textbf{I}$ntro$\textbf{S}$p$\textbf{E}$ction, an approach for fine-tuning LLMs to introduce this ability. Our approach prescribes an iterative fine-tuning procedure, which attempts to teach the model how to alter its response after having seen previously unsuccessful attempts to solve a problem with additional environment feedback. $\textbf{RISE}$ poses fine-tuning for a single-turn problem as solving a multi-turn Markov decision process (MDP), where the initial state is the prompt. Inspired by principles in online imitation learning, we derive effective strategies to dictate multi-turn data collection and training so as to imbue an LLM with the capability to recursively detect and correct its previous mistakes in subsequent iterations. Our experiments show that $\textbf{RISE}$ enables 7B Llama2 and Mistral models to improve themselves with more turns on math reasoning tasks, outperforming several single-turn strategies given an equal amount of inference-time computation. Our analysis shows that $\textbf{RISE}$ makes meaningful improvements to responses to arrive at the correct solution for challenging prompts, without disrupting one-turn abilities.

Submission Number: 47

Loading