OCR-Aware Scene Graph Generation Via Multi-modal Object Representation Enhancement and Logical Bias Learning

Xinyu Zhou, Zihan Ji, Anna Zhu

Published: 01 Jan 2024, Last Modified: 05 Mar 2025PRCV (7) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Scene Graph Generation (SGG) is the task of mapping an image or a video into a semantic structural scene graph automatically for better scene understanding. It requires detecting objects and building their relations. Current SGG methods ignore an essential element of the scene images, i.e., scene text. To better utilize this information for more comprehensive image understanding, we introduce it into the SGG task and propose an OCR-aware Scene Graph Generation (OSGG) baseline approach. To solve the training bias in both SGG and OSGG tasks, we present a novel learning strategy based on causal inference to remove the bad bias and make the prediction process more rational. The feature representations of objects are one of the keys to these tasks but generally extracted from bounding boxes, which are coarse. To obtain more fine-grained object features, we propose a visual feature enhancement module that fuses linguistic modality and integrates cross-modal attention. For evaluation, we provide a new OCR-aware dataset, TextCaps-SG, to benchmark the performance. Experimental results on this dataset and the Visual Gnome (VG) dataset demonstrate the effectiveness of each designed module and verify the superiority of our proposed method over other state-of-the-art methods. Moreover, we apply our generated OSG to cross-modal retrieval tasks. Experiments conducted on COCO TextCaps (CTC) and TextCaps-SG further illustrate that our method significantly outperforms the previous SG-based retrieval methods and could achieve competitive or better results than some large-scale models.