Keywords: Latent Action Models, VLA, Imitation Learning, VLM
TL;DR: We show that representations obtained by simply asking VLMs to ignore distractors can significantly improve Latent Action Models performance in the presence of distractors
Abstract: Latent Action Models (LAMs) have rapidly gained traction as an important component in the pre-training pipelines of leading Vision-Language-Action models. However, they fail when observations contain action-correlated distractors, often encoding noise instead of meaningful latent actions. Humans, on the other hand, can effortlessly distinguish task-relevant motions from irrelevant details in any video given only a brief task description. In this work, we propose to utilize the common-sense reasoning abilities of Vision-Language Models (VLMs) to provide promptable representations, effectively separating controllable changes from the noise in unsupervised way. We use these representations as a targets during LAM training and benchmark a wide variety of popular VLMs. Our results show that simply asking VLMs to ignore distractors can substantially improve latent action quality, yielding up to a six-fold increase in downstream success rates on Distracting MetaWorld.
Lightning Talk Video: mp4
Submission Number: 26
Loading