Hierarchical Perceptual and Predictive Analogy-Inference Network for Abstract Visual Reasoning

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Advances in computer vision research enable human-like high-dimensional perceptual induction over analogical visual reasoning problems, such as Raven's Progressive Matrices (RPMs). In this paper, we propose a Hierarchical Perception and Predictive Analogy-Inference network (HP$^2$AI), consisting of three major components that tackle key challenges of RPM problems. Firstly, in view of the limited receptive fields of shallow networks in most existing RPM solvers, a perceptual encoder is proposed, consisting of a series of hierarchically coupled Patch Attention and Local Context (PALC) blocks, which could capture local attributes at early stages and capture the global panel layout at deep stages. Secondly, most methods seek for object-level similarities to map the context images directly to the answer image, while failing to extract the underlying analogies. The proposed reasoning module, Predictive Analogy-Inference (PredAI), consists of a set of Analogy-Inference Blocks (AIBs) to model and exploit the inherent analogical reasoning rules instead of object similarity. Lastly, the Squeeze-and-Excitation Channel-wise Attention (SECA) in the proposed PredAI discriminates essential attributes and analogies from irrelevant ones. Extensive experiments over four benchmark RPM datasets show that the proposed HP$^2$AI achieves significant performance gains over all the state-of-the-art methods consistently on all four datasets.
Relevance To Conference: This work targets at solving Raven's Progressive Matrices, a typical format of Intelligence Quotient Tests. It is a variant of Visual Question Answering (VQA) task, where questions, answers and cues are all in the format of images. One tester (human or system) needs to induce a common controlling rule from the first two rows of images, and based on the first two images given in the third row, select a correct answer from the option set to so that the rules underlying all three rows are consistent. Specifically, the proposed Hierarchical Patch Attention and Local Context module well perceives the images from multi-receptive fields, the proposed Predictive Analogy-Inference reasoning module well uncovers the underlying rules to derive the correct answer, and the proposed Squeeze-and-Excitation Channel-wise Attention discriminates essential attributes and analogies from irrelevant ones. The proposed method consistently outperforms state-of-the-art models on four benchmark datasets. The proposed method can be widely applied not only in intelligence assessment of humans, but also in multimedia contents like CAPTCHA, online oral quiz, etc. This work significantly advances the research in visual abstract reasoning.
Supplementary Material: zip
Primary Subject Area: [Content] Media Interpretation
Secondary Subject Area: [Experience] Multimedia Applications, [Content] Vision and Language, [Engagement] Summarization, Analytics, and Storytelling
Submission Number: 408
Loading