Keywords: Scene Graph Embeddings, Task Agnostic Representations, Vision Alignment
Abstract: Scene Graphs (SGs) provide richer information, stronger structure, and greater interpretability compared to raw images. However, existing SG embedding models
are typically trained in a task-specific manner, limiting their ability to generalize across downstream applications. To address this, we introduce Task-Agnostic
Embedding of Scene Graphs using Vision Alignment (TESA). TESA employs a
Graph Neural Network trained with a cosine similarity loss to align ground-truth
SG embeddings with image embeddings produced by a frozen, pretrained foundation model. This design leverages the generalization properties of foundation
models and transfers them into the SG domain. We evaluate embedding quality on three datasets (VG-150, PSG, and GQA) and assess generalization on four
downstream tasks: Image Retrieval, Visual Question Answering, Scene Graph
Generation, and Image Classification. Our experiments show that replacing visual embeddings generated by foundation models with TESA embeddings yields
comparable or even improved performance. These results demonstrate that TESA
produces high-quality, task-agnostic SG embeddings that retain the structural and
interpretability advantages of scene graphs, while achieving effectiveness on par
with image-based representations
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 7443
Loading