Keywords: Vision Language Model, Embodied AI, Computer Vision
Abstract: A common method for creating Vision-Language-Action (VLA) models involves fine-tuning pre-trained Vision-Language Models (VLMs) for robotic control. However, this adaptation process often leads to \textbf{catastrophic forgetting}, where the VLM's original powerful reasoning capabilities are degraded. We identify that this issue stems from a fundamental task conflict: fine-tuning on dense, continuous action trajectories is misaligned with the VLM's pre-training objectives. To tackle this, we propose the \textbf{Narrowing of Trajectory VLA (NoTVLA)} framework, which mitigates catastrophic forgetting by reframing the action generation task. Instead of dense trajectories, NoTVLA learns to predict sparse, semantically meaningful trajectory 3D points leading to keyframes.
This approach aligns the fine-tuning task more closely with the VLM's inherent strengths, preserving its reasoning abilities. A key innovation of NoTVLA lies in its trajectory planning strategy, which uses temporal compression and spatial pruning for the robot end-effector's path. In multi-task evaluations, NoTVLA achieves superior performance and generalization compared to baselines like $\pi_0$, while using over an order of magnitude less compute and not necessarily need wrist-mounted camera.
This design ensures that NoTVLA’s operational accuracy closely approximates that of single-task expert models. Crucially, by mitigating catastrophic forgetting, it preserves the model’s inherent language capabilities, enabling \textbf{zero-shot generalization} in specific scenarios, supporting unified model deployment \textbf{across multiple robot platforms}, and fostering generalization even when \textbf{perceiving tasks from novel perspectives}.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 105
Loading