Keywords: Vision-Language-Action Models, Interpretability, Action Chunking Policies
TL;DR: MAQ composes each action chunk from multiple learned primitive queries instead of a single fixed query, improving robotic policy performance while revealing reusable, task-agnostic skills through interpretable router weights.
Abstract: Recent advances in large vision language action models and chunk based or parallel decoding have enabled robust and precise low level robotic control from visual observations and language instructions. Yet most policies generate action chunks from a single fixed embedding or placeholder tokens, which forces one query to account for all behaviors and limits adaptivity across environments and tasks. Motivated by neuroscience on action selection, we model action chunks as mixtures of multiple primitive action queries, enabling adaptive and interpretable representations. We introduce the Mixture of Action Queries (MAQ), a lightweight and model agnostic module for existing parallel decoders. MAQ composes each chunk from a small set of learned action queries that represent reusable skills. A learnable router conditions on the current vision and language context and assigns mixture weights to form the chunk. MAQ integrates into existing decoders without changes to the backbone or latent dimensionality. We validate MAQ by integrating it into ACT and recent state of the art VLA models. Across real world and simulated settings, MAQ improves performance and provides chunk level interpretability that single query designs do not offer. In multi task training, the router assigns consistent queries to the same primitives across tasks, indicating reusable task agnostic skills and providing human interpretable insight into policy behavior.
Submission Number: 41
Loading