Abstract: Scene Graph Generation (SGG) is an important cross-modal task in scene understanding, aiming to detect visual relations in an image. However, due to the various appearance features, the feature distributions of different categories have suffered from a severe overlap, which makes the decision boundaries ambiguous. The current SGG methods mainly attempt to re-balance the data distribution, which is dataset-dependent and limits the generalization. To solve this problem, a Synergetic Prototype Learning Network (SPLN) is proposed here, where the generalized semantic space is modeled and the synergetic effect among different semantic subspaces is delved into.
In SPLN, a Collaboration-induced Prototype Learning method is proposed to model the interaction of visual semantics and structural semantics. The conventional visual semantics is focused on with a residual-driven representation enhancement module to capture details. And the intersection of structural semantics and visual semantics is explicitly modeled as conceptual semantics, which has been ignored by existing methods. Meanwhile, to alleviate the noise of unrelated and meaningless words, an Intersection-induced Prototype Learning method is also proposed specially for conceptual semantics with an essence-driven prototype enhancement module. Moreover, a Selective Fusion Module is proposed to synergetically integrate the results of visual, structural, conceptual branches and the generalized semantics projection. Experiments on VG and GQA datasets show that our method achieves state-of-the-art performance on the unbiased metrics, and ablation studies validate the effectiveness of each component.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Content] Multimodal Fusion
Relevance To Conference: Scene Graph Generation (SGG) is an important cross-modal task in scene understanding, aiming to detect visual relations in an image. Here a Synergetic Prototype Learning Network (SPLN) is proposed to deal with the ambiguity decision boundaries based on the decomposed generalized semantic space of SGG. In SPLN, a Collaboration-induced Prototype Learning method is proposed to model the interaction of visual semantics and structural semantics. The conventional visual semantics is focused on with a residual-driven representation enhancement module to capture details. And the intersection of structural semantics and visual semantics is explicitly modeled as conceptual semantics, which has been ignored by existing methods. Meanwhile, to alleviate the noise of unrelated and meaningless words, an Intersection-induced Prototype Learning method is also proposed specially for conceptual semantics with an essence-driven prototype enhancement module. Moreover, a Selective Fusion Module is proposed to model the distribution based on three semantic branches and the generalized semantics projection. Our SPLN achieves state-of-the-art performance on Visual Genome and GQA datasets, which demonstrates the effectiveness of synergetic effect among different semantic subspaces and provides a new way to deal with multimodal tasks.
Submission Number: 2032
Loading