Abstract: Multimodal topic detection is an important social media analysis task with a wide variety of real-world applications. However, modeling data jointly, and inferring their topics, is challenging due to the semantic gaps between different modalities. Our insights are from the psychological findings pretaining to the hierarchical structure in humans' inherent perception of images and texts. In this paper, we propose a Multimodal Hierarchical Reasoning Network (MHRN) to perform multimodal inference for topic detection. The images and texts are represented in a hierarchical model named the Multimodal Part-whole Aware Graph (MPAG). MHRN then performs reasoning for topic inference based on three modules, which include a Bottom-Up Aggregation (BUA) module for encoding the hierarchical connections and sibling relations in MPAG, a Top-Down Guidance (TDG) module for enriching features of the nodes in MPAG guided by their parents, and a Bottom-Up Cross Aggregation (BUCA) module for capturing and aggregating the cross-modality cues to achieve effective multimodal reasoning. Extensive experiments are conducted on two benchmarks, and the results demonstrate the superiority of our approach.
Loading