Align R-CNN: A Pairwise Head Network for Visual Relationship Detection

Mitra Tajrobehkar, Kaihua Tang, Hanwang Zhang, Joo-Hwee Lim

2022 (modified: 24 Apr 2023)IEEE Trans. Multim. 2022Readers: Everyone

Abstract: Scene graphs connect individual objects with visual relationships. They serve as a comprehensive scene representation for downstream multimodal tasks. However, by exploring recent progress in Scene Graph Generation (SGG), we find that the performance of recent works is highly limited by the pairwise relationship modeling by naive feature concatenation. Such pairwise features lack sufficient object interaction due to the mis-aligned object parts, resulting in non-discriminative pairwise features for visual relationship prediction. For example, naive concatenated pairwise feature usually make the model fail to discriminate between <monospace xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">riding</monospace> and <monospace xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">feeding</monospace> for object pair <monospace xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">person</monospace> and <monospace xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">horse</monospace> . To this end, we design a meta-architecture— learning-to-align — for dynamic object feature concatenation. We call our model: <bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Align R-CNN</b> . Specifically, we introduce a novel attention-based multiple region alignment module that can be jointly optimized with SGG. Experiments on the large-scale SGG benchmark Visual Genome show that the proposed <bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Align R-CNN</b> can replace the naive feature concatenation and thus boost all the existing SGG methods.

0 Replies