SEAL: Skill-Embedded Action for Hierarchical Vision–Language–Action Models

Shih-Min Yang; Martin Magnusson; Johannes A. Stork; Todor Stoyanov

SEAL: Skill-Embedded Action for Hierarchical Vision–Language–Action Models

Shih-Min Yang, Martin Magnusson, Johannes A. Stork, Todor Stoyanov

19 Sept 2025 (modified: 25 Sept 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Vision–Language–Action Models, Robotic Manipulation, Hierarchical Learning

TL;DR: SEAL is a hierarchical VLA model with a skill-embedding bottleneck that reduces VLM usage and error accumulation for efficient, low-latency, long-horizon execution.

Abstract: Vision-Language-Action (VLA) models are promising for robotic foundation models, but monolithic designs that decode actions with a large vision–language model (VLM) at each timestep are computationally costly and require extensive demonstration data. We propose SEAL, a hierarchical VLA model that separates long-horizon reasoning from stepwise control: a high-level module outputs a compact skill embedding at a lower temporal frequency, and a lightweight low-level policy executes closed-loop actions at every step. This embedding forms a bottleneck that provides temporal abstraction, enabling better learning efficiency. We stabilize this representation with variance–covariance regularization to prevent collapse, and apply a contrastive loss to ensure different skills guide different behaviors. By reducing the frequency of VLM inference, our method also lowers per-step latency without degrading task success. SEAL replaces the standard vision/language $\rightarrow$ action paradigm with a vision/language $\rightarrow$ skill $\rightarrow$ action hierarchy. This explicit, regularized skill bottleneck reduces error accumulation, achieves lower latency, and improves data efficiency. On the LIBERO benchmark, it achieves faster inference and improved data efficiency while maintaining or exceeding success rates, demonstrating that the hierarchical VLA model with regularized skill bottlenecks is a scalable path toward robotic foundation models.

Primary Area: foundation or frontier models, including LLMs

Code Of Ethics: true

Submission Guidelines: true

Anonymous Url: true

No Acknowledgement Section: true

Submission Number: 21590

Loading