TL;DR: Method for training on synthetic data to improve LLMs' sequential decision making capabilities
Abstract: Efficient exploration is essential for intelligent systems interacting with their environment, but existing language models often fall short in scenarios that require strategic information gathering. In this paper, we present **Paprika**, a fine-tuning approach that enables language models to develop general decision-making capabilities that are not confined to particular environments. By training on synthetic interaction data from different tasks that require diverse strategies, Paprika teaches models to explore and adapt their behavior on a new task based on environment feedback in-context without more gradient updates. Experimental results show that models fine-tuned with Paprika can effectively transfer their learned decision-making capabilities to entirely unseen tasks without additional training. Unlike traditional training, our approach's primary bottleneck lies in sampling useful interaction data instead of model updates. To improve sample efficiency, we propose a curriculum learning strategy that prioritizes sampling trajectories from tasks with high learning potential. These results suggest a promising path towards AI systems that can autonomously solve novel sequential decision-making problems that require interactions with the external world.
Lay Summary: Many AI agents struggle to gather information efficiently when they face new environments, because they lack general strategies for exploration and trying new things.
Our method, **Paprika**, addresses this through a two-stage fine-tuning process: we first let a language model “play” in many simulated tasks to collect diverse examples of trial and error, then use a preference-based scoring method to teach the model which experiences led to success. Since the model sees many different tasks at the same time, it is incentivized to learn a general problem-solving strategy rather than memorizing each task individually. To make this experience gathering more efficient, Paprika orders these tasks by their “learning potential”—that is, how much each one teaches the model about intelligent and efficient exploration—and focuses on the most informative tasks first. This procedure is similar to how human students learn from a curriculum of increasing difficulty relative to the student’s knowledge.
In experiments, models fine-tuned with Paprika transfer these decision-making skills to entirely unseen tasks with no extra training. By shifting the main challenge from costly model updates to smart data selection, Paprika paves the way for AI systems that can autonomously tackle novel, sequential decision-making problems with minimal human intervention.
Link To Code: https://github.com/tajwarfahim/paprika
Primary Area: Deep Learning->Large Language Models
Keywords: LLM Agent, Synethic Data, Multiturn finetuning
Submission Number: 13556
Loading