GateFlow: Mitigating Shortcut Learning in VLA Models via Gated Flow Matching

15 Sept 2025 (modified: 14 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Vision-Language-Action, Flow Matching, Foundation Models, Embodied AI, Robotics
TL;DR: VLA models may easily learn shortcuts by exploiting spurious correlations. GateFlow uses transport distance to detect and suppress shortcuts while enhancing genuine understanding.
Abstract: Vision-Language-Action (VLA) models promise general-purpose robotic intelligence by leveraging pretrained vision-language representations. However, these models suffer from shortcut learning—exploiting spurious correlations between visual patterns and actions rather than developing semantic understanding. This occurs because VLA models optimize an Evidence Lower Bound (ELBO) proxy instead of the true likelihood, creating an optimization gap that enables memorized patterns to masquerade as genuine solutions. To mitigate this problem, we introduce GateFlow, a transport-guided gating mechanism that detects and suppresses shortcut learning by measuring the Wasserstein distance between observation and action representations. Low transport distance indicates semantic understanding and receives strong enhancement, while high distance reveals shortcuts and triggers suppression. This selective gating closes the ELBO-NLL gap by guiding optimization toward true likelihood minimization. We provide theoretical guarantees showing that GateFlow concentrates gradients on semantic features while eliminating spurious patterns. Empirically, GF-VLA achieves state-of-the-art performance on various tasks, with substantial improvements on long-range tasks or complex scenarios under non-stationary perturbations. GateFlow integrates seamlessly into existing VLA architectures with minimal computational overhead, offering a practical solution to more general robotic learning.
Primary Area: applications to robotics, autonomy, planning
Submission Number: 6112
Loading