Abstract: Highlights•A novel textual representation for complex scenes based on location tokens.•Location tokens allow Language Models to ground spatial relations between objects.•Using an automatic synthetic dataset we train Language Models for spatial grounding.•The learned grounding mechanisms transfer to the Visual Spatial Reasoning dataset.•An extensive analysis shows the importance of location tokens and spatial training.
Loading