Coordinated Cross-Modal Token Reuse with a Unified Mask for Efficient and Accurate Inference in VLA Models
Keywords: Vision Language Models, Vision Language Action Models, Efficient Inferrence
Abstract: Vision-language-action (VLA) inference suffers from substantial redundancy across consecutive frames, where large regions of the visual input remain unchanged while the model repeatedly re-encodes the entire scene. We propose a training-free approach for efficient and accurate VLA inference based on \emph{coordinated cross-modal token reuse}. Our method introduces a \emph{unified mask} that identifies static and task-irrelevant visual patches using a two-stage criterion combining temporal appearance consistency and attention-based relevance. The unified mask drives reuse consistently in both the vision encoder and the language model: cached visual representations are reused for selected patches, while dynamic or task-critical regions are recomputed. This coordinated reuse preserves cross-modal consistency and enables end-to-end acceleration without modifying model architecture or requiring finetuning. Experiments on robotic manipulation tasks demonstrate that the proposed approach improves inference efficiency while maintaining or improving task success rates, validating the effectiveness of unified, cross-modal token reuse in VLA models.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: Multimodality and Language Grounding to Vision, Robotics and Beyond, Efficient / Low-Resource Methods for NLP
Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models
Languages Studied: English
Submission Number: 8977
Loading