Coordinated Cross-Modal Token Reuse with a Unified Mask for Efficient and Accurate Inference in VLA Models

Coordinated Cross-Modal Token Reuse with a Unified Mask for Efficient and Accurate Inference in VLA Models

ACL ARR 2026 January Submission8977 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Vision Language Models, Vision Language Action Models, Efficient Inferrence

Abstract: Vision-language-action (VLA) inference suffers from substantial redundancy across consecutive frames, where large regions of the visual input remain unchanged while the model repeatedly re-encodes the entire scene. We propose a training-free approach for efficient and accurate VLA inference based on \emph{coordinated cross-modal token reuse}. Our method introduces a \emph{unified mask} that identifies static and task-irrelevant visual patches using a two-stage criterion combining temporal appearance consistency and attention-based relevance. The unified mask drives reuse consistently in both the vision encoder and the language model: cached visual representations are reused for selected patches, while dynamic or task-critical regions are recomputed. This coordinated reuse preserves cross-modal consistency and enables end-to-end acceleration without modifying model architecture or requiring finetuning. Experiments on robotic manipulation tasks demonstrate that the proposed approach improves inference efficiency while maintaining or improving task success rates, validating the effectiveness of unified, cross-modal token reuse in VLA models.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: Multimodality and Language Grounding to Vision, Robotics and Beyond, Efficient / Low-Resource Methods for NLP

Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models

Languages Studied: English

Submission Number: 8977

Loading