Viewpoint-Invariant Latent Action Learning from Human Video Demonstrations

Published: 23 Sept 2025, Last Modified: 19 Nov 2025SpaVLE PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Learning from Videos, Viewpoint-Invariant Latent Action
Abstract: Learning representations of visual transitions between consecutive frames in video enables robots to learn from both robot and human demonstrations. These representations, referred to as latent actions, capture the inherent state transition. However, continuously changing viewpoints in human videos introduce the ambiguity that hinders consistent modeling of latent actions. To address this issue, we propose \textbf{Vi}ew\textbf{P}oint-\textbf{I}nvariant \textbf{L}atent \textbf{A}ction, or ViPILA, a representation of visual transitions that is robust to the viewpoint variation from human videos without action labels and camera calibrations. Building on a theoretical analysis of viewpoint-invariance, we introduce a novel training objective that enforces consistency in latent actions across different viewpoints of the same state. The key idea is to enforce that a latent action inferred from one viewpoint can be used to reconstruct the observation from a different viewpoint, as long as the underlying state remains the same. We empirically demonstrate that the resulting viewpoint-invariant latent actions improve downstream manipulation policy learning in LIBERO simulation.
Submission Type: Long Research Paper (< 9 Pages)
Submission Number: 2
Loading