FoR-SALE: Frame of Reference-guided Spatial Adjustment in LLM-based Diffusion Editing

Published: 28 Apr 2026, Last Modified: 28 Apr 2026MSLD 2026 PosterEveryoneRevisionsCC BY 4.0
Keywords: Frame of Reference, Spatial Relation, Text to Image, Diffusion Model
Abstract: Spatial understanding is essential for enabling machines to achieve human-like performance in several tasks. One key aspect of this ability is the Frame of Reference (FoR) that indicates the perspective from which spatial relations are interpreted. While extensively studied in cognitive linguistics, FoRs have received limited attention in AI models, particularly within Text-to-Image (T2I) Tasks. The current SOTA of T2I models (e.g., GPT4o, Stable Diffusion 2.1) exhibit a significant performance gap when spatial descriptions are provided from non-camera perspectives. To address this issue, we propose the **F**rame **o**f **R**eference-guided **S**patial **A**djustment in **L**LM-based Diffusion **E**diting (FoR-SALE) framework. Our approach builds upon the Self-correcting LLM-controlled Diffusion (SLD) pipeline, which uses LLMs to validate prompts and generate suggested layouts for editing images through latent-space operations. However, the original SLD framework does not account for FoR, limiting its ability to handle spatial prompts grounded in non-camera perspectives. FoR-SALE extends this paradigm by explicitly modeling FoR and enabling spatial adjustment over diverse perspectives. The FoR-SALE pipeline begins with standard T2I generation, where a context is passed to a T2I module to produce an initial image. Then, it employs vision modules to extract the image's spatial configuration and simultaneously map the spatial expression to a corresponding camera perspective using Layout Interpreter and FoR-Interpreter. This unified perspective enables direct evaluation of alignment between language and vision. When misalignment is detected, the required editing operations are generated and applied. FoR-SALE applies novel latent-space operations to adjust the facing direction and depth of the generated images. We demonstrate the effectiveness of FoR-SALE using two benchmarks: FoR-LMD, a modification of the LMD benchmark that includes perspective, and FoREST a benchmark that includes textual input for various FoR cases. We observed that our technique can improve images generated by SD-3.5-large, FLUX.1, and GPT-4o, SOTA models for T2I tasks, yielding up to a 5.30 % improvement in a single correction round and 9.90 % in three rounds. Moreover, we provide a thorough analysis that highlights the limitations of T2I models and LLMs used to suggest layouts from different perspectives. Using GPT-4o as the base generator, our method achieves SOTA performance on spatial expressions involving FoRs, particularly for intrinsic FoRs, which are especially challenging. These results demonstrate the robustness of reasoning over FoR of our proposed framework
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 42
Loading