OTTER: A Vision-Language-Action Model with Text-Aware Visual Feature Extraction

Huang Huang; Fangchen Liu; Letian Fu; Tingfan Wu; Mustafa Mukadam; Jitendra Malik; Ken Goldberg; Pieter Abbeel

OTTER: A Vision-Language-Action Model with Text-Aware Visual Feature Extraction

Huang Huang, Fangchen Liu, Letian Fu, Tingfan Wu, Mustafa Mukadam, Jitendra Malik, Ken Goldberg, Pieter Abbeel

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: A vision-language model using a frozen VLM for strong zero-shot generalization ability

Abstract: Vision-Language-Action (VLA) models aim to predict robotic actions based on visual observations and language instructions. Existing approaches require fine-tuning pre-trained vision-language models (VLMs) as visual and language features are independently fed into downstream policies, degrading the pre-trained semantic alignments. We propose OTTER, a novel VLA architecture that leverages these existing alignments through explicit, text-aware visual feature extraction. Instead of processing all visual features, OTTER selectively extracts and passes only task-relevant visual features that are semantically aligned with the language instruction to the policy transformer. This allows OTTER to keep the pre-trained vision-language encoders frozen. Thereby, OTTER preserves and utilizes the rich semantic understanding learned from large-scale pre-training, enabling strong zero-shot generalization capabilities. In simulation and real-world experiments, OTTER significantly outperforms existing VLA models, demonstrating strong zero-shot generalization to novel objects and environments. Video, code, checkpoints, and dataset: https://ottervla.github.io/.

Lay Summary: Teaching robots to follow instructions like “pick up the red cup” is hard, especially with new objects or settings. Most methods retrain the robot’s vision and language models, which can weaken their vision and language understanding. We introduce OTTER, a new approach that helps robots follow instructions by working with perception models that already understand how images relate to language, without retraining them. OTTER selectively focuses only on the parts of an image that are relevant to the instruction—like just the “red cup”—and passes that focused information to the robot’s decision-making system. Experiments show OTTER outperforms current methods, bringing us closer to robots that understand and act on instructions in many situations.

Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.

Link To Code: https://github.com/FangchenLiu/otter_jax

Primary Area: Applications->Robotics

Keywords: vision language action model; robot foundation model

Submission Number: 5063

Loading