Abstract: Spatial perception is a crucial component of intelligence and plays a vital role in understanding the physical world. Current multimodal large language models (MLLMs) exhibit promising spatial perception abilities. However, existing datasets are limited to absolute spatial perspectives or small-scale, unstructured collections lacking systematic variations in spatial relationships. This hinders models in understanding how rotation affects spatial relations. To address this issue, we propose ROTATE, a novel pipeline for synthesizing spatial relation datasets, and create the ROTATE dataset with 48K synthetic images, 608K captions, and 250K QA pairs covering both relative and absolute spatial perspectives. To further enhance the model's understanding of rotation, we propose a novel task: Spatial Difference Generation. In this task, the model must identify and generate both commonalities and differences in spatial relationships between paired images. Experimental results show that through three-stage training, the ROTATE dataset significantly improves the model's ability to comprehend spatial relationships from both relative and absolute perspectives. Furthermore, incorporating the Spatial Difference Generation task during training yields additional improvements in rotation comprehension and increases response consistency. Dataset and code will be published after the paper is published.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: vision question answering, multimodality
Contribution Types: Data resources
Languages Studied: English
Submission Number: 1816
Loading