Vision-Language Models Unlock Task-Centric Latent Actions

Alexander Nikulin; Ilya Zisman; Denis Tarasov; Albina Klepach; Alexander Derevyagin; Andrei Polubarov; Lyubaykin Nikita; Vladislav Kurenkov

Vision-Language Models Unlock Task-Centric Latent Actions

Alexander Nikulin, Ilya Zisman, Denis Tarasov, Albina Klepach, Alexander Derevyagin, Andrei Polubarov, Lyubaykin Nikita, Vladislav Kurenkov

Published: 06 Sept 2025, Last Modified: 26 Sept 2025CoRL 2025 Robot Data WorkshopEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Latent Action Models, VLA, Imitation Learning, VLM

TL;DR: We show that representations obtained by simply asking VLMs to ignore distractors can significantly improve Latent Action Models performance in the presence of distractors

Abstract: Latent Action Models (LAMs) have rapidly gained traction as an important component in the pre-training pipelines of leading Vision-Language-Action models. However, they fail when observations contain action-correlated distractors, often encoding noise instead of meaningful latent actions. Humans, on the other hand, can effortlessly distinguish task-relevant motions from irrelevant details in any video given only a brief task description. In this work, we propose to utilize the common-sense reasoning abilities of Vision-Language Models (VLMs) to provide promptable representations, effectively separating controllable changes from the noise in unsupervised way. We use these representations as a targets during LAM training and benchmark a wide variety of popular VLMs. Our results show that simply asking VLMs to ignore distractors can substantially improve latent action quality, yielding up to a six-fold increase in downstream success rates on Distracting MetaWorld.

Lightning Talk Video: mp4

Submission Number: 26

Loading