On the Representation Degradation in Vision-Language-Action Models

Zhilong Zhang; Xiong-Hui Chen; Yidi Wang; Yihao Sun; Wenyu Luo; Haoxiang Ren; Haoxin Lin; Yang Yu

On the Representation Degradation in Vision-Language-Action Models

Zhilong Zhang, Xiong-Hui Chen, Yidi Wang, Yihao Sun, Wenyu Luo, Haoxiang Ren, Haoxin Lin, Yang Yu

19 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: robot policy learning, vision-language-action models, representation learning

Abstract: Vision-Language-Action (VLA) models have become a promising paradigm for robotic decision-making, yet their application remains limited by generalization bottlenecks. In this paper, we conduct a layer-wise representation analysis and uncover a previously overlooked phenomenon of representation degradation: deeper layers tasked with action generation exhibit diminished generalization to both semantic information and environmental dynamics. To mitigate this issue, we introduce hidden Space WOrld modeLing (SWOL), a lightweight but efficient approach that aligns degraded deep-layer features with more generalizable mid-layer representations extrapolated from future observations. SWOL enforces temporally consistent, action-grounded representations without modifying model architecture or inference procedures. Extensive experiments in simulation and real-world settings demonstrate that SWOL alleviates representation degradation, leading to improved policy effectiveness and stronger generalization across modalities of vision, language, and dynamics.

Supplementary Material: zip

Primary Area: applications to robotics, autonomy, planning

Submission Number: 18448

Loading