VITA: Vision-To-Action Flow Matching Policy

Anonymous Authors
VITA is an efficient and performant policy that directly flows from latent images to latent actions
without sampling from Gaussian noise or relying on conditioning modules.
VITA: Vision-to-Action Flow Matching
Noise-Free, Conditioning-Free Policy Learning
Camera Image
Camera Image
Latent Images
Flow Matching
Latent Actions
Action Sequence
Ready to start VITA flow

What is VITA?

We present VITA, a VIsion-To-Action flow matching policy that evolves latent visual representations into latent actions via flow matching for visuomotor control. Conventional flow matching and diffusion policies face a fundamental inefficiency: they sample from standard source distributions (e.g., Gaussian noise) and then require additional conditioning mechanisms, such as cross-attention, to repeatedly inject visual inputs at each generation step, incurring time and space overheads. We propose VITA, a novel paradigm that treats latent images as the source of the flow, and learns an inherent mapping from vision to action. Because the source of the flow is visually grounded, VITA eliminates the need for repeated conditioning during generation. We evaluate VITA on 9 simulation and 5 real-world tasks from ALOHA and Robomimic. Despite its simplicity, VITA outperforms or matches state-of-the-art policies, while speeding up inference by 1.5x to 2x. VITA inherently enables simpler architectures such as MLPs. To our knowledge, VITA is the first MLP-only flow matching policy capable of solving complex bi-manual manipulation tasks like those in ALOHA benchmarks.

VITA Framework



VITA learns a continuous flow from latent visual representations to latent actions. Because the source of the flow is visually grounded, VITA eliminates the need for repeated conditioning during generation.

VITA Denoising

Comparison of the denoising process between conventional flow matching and VITA. Conventional flow matching denoises random Gaussian into actions, VITA flows from latent images to latent actions. We found that through VITA learning, latent images manifest action semantics. The latent image can decoded into a smooth trajectory, and progressively refined by the ODE process.



Addressing Latent Collapse of E2E Latent Flow Matching Training

Unlike latent diffusion for image generation, where the target latent space can be trainied via abundant image data, action data is sparse and limited and thus the target latent space is hard to be well pre-trained and frozen as the target for flow matching. Naively end-to-end training flow matching along with the target latent space leads to latent collapse (Figure (a)). We first time identify the cause of the issue as the training-test time gap between encoder-based latents and ODE-generated latents. We propose flow latent decoding (FLD), to backpropagate through the flow ODE solver during training, to close the gap by anchoring latent representations using ground-truth targets.



Efficiency



The table compares the inference latency and peak memory usage of different flow matching policies when using vector-based (Vector) representations or grid-based representations (Grid) for visual features. VITA achieves 1.5x to 2x faster inference speed and reduces memory usage by 18.6% to 28.7%.

Success Rates



We evaluate VITA on challenging bi-manual manipulation tasks, and single-arm tasks including 9 simulation and 5 real-world tasks on ALOHA, AV-ALOHA, Robomimic, covering bimnual and single-arm manipulation tasks. The MLP-only VITA consistently outperforms or matches state-of-the-art policies (including transformer-based conventional flow matching policy), while being significantly more efficient.

VITA Demos: Real-World Tasks

VITA demonstrates robustness to online perturbations.

Online Perturbations

VITA demonstrates generalization to unseen objects.

Unseen Objects

Bimanual Tasks with Active Vision

Two challenging bimanual manipulation tasks on AV-ALOHA with an additional 7-DoF arm carrying an active vision camera. The robot must predict and reach the best viewpoint to avoid occulusions and increase precision.

Hidden Pick

Transfer From Box

VITA Demos: Real-World and Simulation Tasks

Hidden Pick

Transfer From Box

Pick Ball

Store Drawer

Thread Needle

Pour Test Tube

Hook Package

Slot Insertion

Transfer Cube

Square

PushT

Training Efficiency



VITA enjoys faster convergence than other policies. We compare the action MSE curves of VITA, FM, DP, and ACT on three real-world tasks. VITA consistently converges faster at lower errors. ACT plateaus early; DP and FM converge slower.