Abstract: Visual reasoning requires models to construct a reasoning process towards the final decision. Previous studies have used attention maps or textual explanations to illustrate the reasoning process, but both have their limitations. Attention maps can be difficult to read, while textual explanations cannot fully describe the process of reasoning, and both are hard to evaluate quantitatively. This paper proposes a novel pixel-to-explanation reasoning model that employs a user-friendly multimodal rationale to depict the reasoning process. The model dissects the question into subquestions, and constructs reasoning cells to retrieve knowledge from the image and question based on these subquestions. The intermediate outcomes from the reasoning cells are translated into object bounding boxes and classes, with the final output beging classified as a standard VQA answer and translated into a complete answer to summarize the entire reasoning process. All the generated results can be combined to produce a human-readable and informative explanation that can be evaluated quantitatively. Besides the interpretability, we achieved a 4.4% improvement over our baseline model on the GQA dataset and attained new state-of-the-art results on the challenging GQA-OOD dataset.
Loading