TESA: Task-Agnostic Embedding of Scene Graphs using Vision Alignment

Paul Mattes; Valentin Quapil; Nils Blank; Rudolf Lioutikov

TESA: Task-Agnostic Embedding of Scene Graphs using Vision Alignment

Paul Mattes, Valentin Quapil, Nils Blank, Rudolf Lioutikov

16 Sept 2025 (modified: 13 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Scene Graph Embeddings, Task Agnostic Representations, Vision Alignment

Abstract: Scene Graphs (SGs) provide richer information, stronger structure, and greater interpretability compared to raw images. However, existing SG embedding models are typically trained in a task-specific manner, limiting their ability to generalize across downstream applications. To address this, we introduce Task-Agnostic Embedding of Scene Graphs using Vision Alignment (TESA). TESA employs a Graph Neural Network trained with a cosine similarity loss to align ground-truth SG embeddings with image embeddings produced by a frozen, pretrained foundation model. This design leverages the generalization properties of foundation models and transfers them into the SG domain. We evaluate embedding quality on three datasets (VG-150, PSG, and GQA) and assess generalization on four downstream tasks: Image Retrieval, Visual Question Answering, Scene Graph Generation, and Image Classification. Our experiments show that replacing visual embeddings generated by foundation models with TESA embeddings yields comparable or even improved performance. These results demonstrate that TESA produces high-quality, task-agnostic SG embeddings that retain the structural and interpretability advantages of scene graphs, while achieving effectiveness on par with image-based representations

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 7443

Loading