Keywords: Image Representations, Vision-Language, Spatial Reasoning, Computer Vision
Abstract: Despite the remarkable performance of large-scale pre-trained image representation models (i.e., vision encoders) across various vision tasks, they often fail to learn spatial relationships within images, constraining their effectiveness in various downstream tasks, e.g., visual spatial reasoning and vision-based robot control, etc. This limitation stems from the scarcity of 3D or multi-view images, making it challenging to inject 3D spatial knowledge into the encoders. To overcome this limitation, we propose a novel learning framework that enhances spatial awareness in existing pre-trained image representation models. The core idea involves converting 3D spatial information into linguistic expressions, which is then used to inject such spatial knowledge into vision encoders through a Large Vision Language Model (LVLM). To further improve spatial awareness, we introduce a multi-turn visual spatial reasoning approach; specifically, we adopt a Chain-of-Thought (CoT) framework to build hierarchical spatial understanding through 10 sequential reasoning turns. The proposed approach enhances pre-trained vision encoders, for example, improving average accuracy on the SpatialRGPT visual language spatial reasoning benchmark from 13.3\% to 52.0\% simply by replacing the vision encoder in LLaVA-1.5-7B.
Submission Type: Short Research Paper (< 4 Pages)
Submission Number: 79
Loading