Finer Behavioral Foundation Models via Auto-Regressive Features and Advantage Weighting

Published: 09 May 2025, Last Modified: 28 May 2025RLC 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Unsupervised RL, offline training, autoregressive features, successor measures
Abstract: The forward-backward representation (FB) is a recently proposed framework (Touati et al., 2023; Touati and Ollivier, 2021) to train be- havior foundation models (BFMs) that aim at providing zero-shot efficient policies for any new task specified in a given reinforcement learning (RL) environment, without training for each new task. Here we address two core limitations of FB model training First, FB, like all successor-feature-based methods, relies on a linear encoding of tasks: at test time, each new reward function is linearly projected onto a fixed set of pre- trained features. This limits expressivity as well as precision of the task representation. We break the linearity limitation by introduc- ing auto-regressive features for FB, which let fine-grained task features depend on coarser- grained task information. This can represent arbitrary nonlinear task encodings, thus sig- nificantly increasing expressivity of the FB framework. Second, it is well-known that training RL agents from offline datasets often requires spe- cific techniques.We show that FB works well together with such offline RL techniques, by adapting techniques from (Nair et al., 2020a; Cetin et al., 2024) for FB. This is necessary to get non-flatlining performance in some datasets, such as DMC Humanoid. As a result, we produce efficient FB BFMs for a number of new environments. Notably, in the D4RL locomotion benchmark, the generic FB agent matches the performance of stan- dard single-task offline agents (IQL, XQL). In many setups, the offline techniques are needed to get any decent performance at all. The auto-regressive features have a positive but moderate impact, concentrated on tasks requiring spatial precision and task general- ization beyond the behaviors represented in the trainset. Together, these results establish that generic, reward-free FB BFMs can be competitive with single-task agents on standard benchmarks, while suggesting that expressivity of the BFM is not a key limiting factor in the environments tested.
Submission Number: 181
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview