Abstract: Scene Graph Generation(SGG) is a scene understanding task aims at identifying object entities and reasoning their relationships within a given image. In contrast to prevailing two-stage methods based on a large object detector (e.g., Faster R-CNN), one-stage methods integrate a fixed-size set of learnable queries to jointly reason relational triplets <subject, predicate, object>. This paradigm demonstrates robust performance with significantly reduced parameters and computational overhead. However, the challenge in one-stage methods stems from the issue of weak entanglement, wherein entities involved in relationships require both coupled features shared within triplets and decoupled visual features. Previous methods either adopt a single decoder for coupled triplet feature modeling or multiple decoders for separate visual feature extraction but fail to consider both. In this paper, we introduce UniQ, a Unified decoder with task-specific Queries architecture, where task-specific queries generate decoupled visual features for subjects, objects, and predicates respectively, and unified decoder enables coupled feature modeling within relational triplets. Experimental results on the Visual Genome dataset demonstrate that UniQ has superior performance to both one-stage and two-stage methods.
Primary Subject Area: [Content] Vision and Language
Relevance To Conference: This work focuses on Scene graph generation, a crucial task in multimedia and multimodal processing. scene graph generation intends to form the semantic graph, where objects serve as nodes and relationships between them are depicted by edges. The structured Scene graphs provide a hierarchical representation of objects and their relationships within a scene, allowing for finer-grained analysis compared to traditional object detection or recognition methods. This high level of understanding can enhance tasks such as image captioning, visual question answering, and content-based image retrieval. This work primarily investigates end-to-end scene graph generation based on transformer architectures. Researching end-to-end SGG models, which circumvent manual design dependencies, fosters the seamless integration of SGG tasks into multimodal large-scale models, thereby augmenting their structured reasoning capabilities.
Submission Number: 4558
Loading