ROTATE: A Synthetic Pipeline for Spatial Relational Data with Systematic Variations

ROTATE: A Synthetic Pipeline for Spatial Relational Data with Systematic Variations

ACL ARR 2025 May Submission1816 Authors

18 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Spatial perception is a crucial component of intelligence and plays a vital role in understanding the physical world. Current multimodal large language models (MLLMs) exhibit promising spatial perception abilities. However, existing datasets are limited to absolute spatial perspectives or small-scale, unstructured collections lacking systematic variations in spatial relationships. This hinders models in understanding how rotation affects spatial relations. To address this issue, we propose ROTATE, a novel pipeline for synthesizing spatial relation datasets, and create the ROTATE dataset with 48K synthetic images, 608K captions, and 250K QA pairs covering both relative and absolute spatial perspectives. To further enhance the model's understanding of rotation, we propose a novel task: Spatial Difference Generation. In this task, the model must identify and generate both commonalities and differences in spatial relationships between paired images. Experimental results show that through three-stage training, the ROTATE dataset significantly improves the model's ability to comprehend spatial relationships from both relative and absolute perspectives. Furthermore, incorporating the Spatial Difference Generation task during training yields additional improvements in rotation comprehension and increases response consistency. Dataset and code will be published after the paper is published.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: vision question answering, multimodality

Contribution Types: Data resources

Languages Studied: English

Submission Number: 1816

Loading