FoR-SALE: Frame of Reference-guided Spatial Adjustment in LLM-based Diffusion Editing

Published: 23 Sept 2025, Last Modified: 19 Nov 2025SpaVLE PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Frame of Reference, Spatial Relation, Text to Image, Diffusion Model
Abstract: Frame of Reference (FoR) is a fundamental concept in spatial reasoning that humans utilize to comprehend and describe space. With the rapid progress in Vision and Language models, the moment has come to integrate this long-overlooked dimension into these models. For example, in text-to-image (T2I) generation, even state-of-the-art models exhibit a significant performance gap when spatial descriptions are provided from perspectives other than the camera. To address this limitation, we propose Frame of reference-guided Spatia Adjustment in LLM-based Diffusion Editing (FoR-SALE), an extension of the Self-correcting LLM-controlled Diffusion (SLD) framework for T2I. Specifically, we exploit visual processing modules, including object detection, depth detection, and orientation detection, to extract the necessary spatial cues for recognizing the possible perspectives. We use LLMs to convert all spatial expressions into a unified camera perspective before interpreting image layout. We exploit an image editing framework and introduce new latent operations to modify the facing direction and depth. We evaluate FoR-SALE on two benchmarks specifically designed to assess spatial understanding with FoR. Our framework improves the performance of state-of-the-art T2I models by up to 5.3\% using only a single round of correction. Additionally, we provide a detailed analysis of the limitations of current T2I models from various perspectives, highlighting potential avenues for future research.
Supplementary Material: zip
Submission Type: Long Research Paper (< 9 Pages)
Submission Number: 21
Loading