Abstract: Most of the existing visual reasoning networks primarily focus on aligning language with specific regions in images, but they often struggle to comprehend the complex spatial logical relationships within real-world scenes. Moreover, these networks lack interpretability. In order to solve the above problems, this paper proposes the Bilinear-MAC network with stronger reasoning ability. The MAC network is a single-stream network that performs reasoning tasks. It only uses the question representation as a control premise to extract image information. It can only understand simple images, but cannot understand complex images based on real scenes. The improved Bilinear-MAC network in this paper is a dual-stream network that performs reasoning tasks. It uses the problem representation as the control premise to extract image information, and uses the image representation as the control premise to extract problem information. As a result, it has a stronger understanding ability for real scene images with complex relationships and rich object types. The proposed network achieves an overall accuracy rate of 59.6% in the GQA dataset, which is 5.6% higher than that of the MAC network, and an accuracy rate of 99.1% in the CLEVR dataset. Experimental results demonstrate that the proposed network is capable of a stronger understanding of complex real-world visual scenes.
Loading