Post-Training LLMs as Better Decision-Making Agents: A Regret-Minimization Approach
Keywords: LLM, sequential decision making
Abstract: Large language models (LLMs) are increasingly deployed as agents for decision-making (DM) in interactive and dynamic environments. Yet, because they are not originally designed for sequential decision-making, recent work shows that LLMs often fail even in basic online DM settings. We propose Iterative Regret-Minimization Fine-Tuning (Iterative RMFT), a post-training framework that repeatedly distills low-regret decision trajectories into a base model. In contrast to prior approaches that distill known algorithms or impose manually crafted reasoning structures, our method uses regret directly as a learning signal to induce improved decision-making behavior, while allowing the model to generate natural-language reasoning organically. Empirically, Iterative RMFT yields consistent improvements in DM performance across a range of models, including numerical Transformers, lightweight open-weight LLMs, and the closed-weight model GPT-4o mini, and demonstrates robust generalization across horizons, action spaces, reward dynamics, and natural-language-specified DM tasks. Overall, we view this work as an initial step toward more principled and fundamentally new post-training paradigms for enabling effective decision-making in LLMs.
Submission Number: 41
Loading