Keywords: LLM agents, online learning, in-context learning.
Abstract: Fine-tuning large language models (LLMs) using online learning, where models learn from self-sampled data and environmental feedback, presents a promising but challenging research direction due to the typically sparse nature of rewards. Traditional methods for addressing this challenge often involve training domain-specific Q-functions to convert sparse rewards into dense signals. However, these methods suffer from poor sample efficiency and limited generalizability.
In this work, we propose a novel framework that leverages the pre-trained knowledge of LLMs to transform sparse rewards into dense supervised signals through in-context learning. Specifically, we introduce a retrospective in-context learning approach, where LLMs assign temporal credit to past actions based on feedback. Unlike previous approaches, which rely heavily on extensive feedback data or intricate prompt engineering, our method uses online learning to iteratively update the policy by combining in-context learning with gradient-based fine-tuning.
We empirically demonstrate the effectiveness of our approach on the BabyAI benchmark, showing that it is significantly more sample-efficient than traditional online reinforcement learning (RL) algorithms while achieving comparable performance to imitation learning. Our findings suggest that LLM-based agents can refine their policies using sparse feedback in an online manner, making them more adaptive to dynamic environments.
Submission Number: 144
Loading