The Synergy of LLMs & RL Unlocks Offline Learning of Generalizable Language-Conditioned Policies with Low-fidelity Data

Thomas Pouplin; Kasia Kobalczyk; Hao Sun; Mihaela van der Schaar

The Synergy of LLMs & RL Unlocks Offline Learning of Generalizable Language-Conditioned Policies with Low-fidelity Data

Thomas Pouplin, Kasia Kobalczyk, Hao Sun, Mihaela van der Schaar

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 spotlightposterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We train LLM agents as Language-conditioned policies without requiring expensive labeled data or online experimentation. The framework leverages LLMs to enable the use of unlabeled datasets and improve generalization to unseen goals and states.

Abstract: Developing autonomous agents capable of performing complex, multi-step decision-making tasks specified in natural language remains a significant challenge, particularly in realistic settings where labeled data is scarce and real-time experimentation is impractical. Existing reinforcement learning (RL) approaches often struggle to generalize to unseen goals and states, limiting their applicability. In this paper, we introduce $\textit{TEDUO}$, a novel training pipeline for offline language-conditioned policy learning in symbolic environments. Unlike conventional methods, $\textit{TEDUO}$ operates on readily available, unlabeled datasets and addresses the challenge of generalization to previously unseen goals and states. Our approach harnesses large language models (LLMs) in a dual capacity: first, as automatization tools augmenting offline datasets with richer annotations, and second, as generalizable instruction-following agents. Empirical results demonstrate that $\textit{TEDUO}$ achieves data-efficient learning of robust language-conditioned policies, accomplishing tasks beyond the reach of conventional RL frameworks or out-of-the-box LLMs alone.

Lay Summary: Training AI agents to follow natural language instructions and complete complex tasks remains a major challenge, especially when we can’t rely on lots of labeled data or real-time trial-and-error exploration strategies. Many current reinforcement learning (RL) methods struggle to handle new goals or previously unseen states, making them hard to apply in the real world. In this paper, we present TEDUO, a new training pipeline that teaches agents to follow language instructions using only pre-recorded, unlabelled datasets. TEDUO employs large language models (LLMs) in two ways: first, to help label and enrich offline data, and second, to act as agents that can interpret and carry out instructions. This combination helps the system learn more efficiently and generalise better to new tasks. Our results show that TEDUO can solve tasks that standalone offline RL methods or out-of-the-box LLMs alone cannot handle, offering a more practical path toward building capable, instruction-following agents.

Link To Code: https://github.com/TPouplin/TEDUO/

Primary Area: Deep Learning->Large Language Models

Keywords: Large Language Models, Goal-conditioned policy, Offline policy learning, Decision Making Agent, Goals generalization, Domain generalization

Submission Number: 11508

Loading