MAR: Matching-Augmented Reasoning for Enhancing Visual-based Entity Question Answering

ACL ARR 2024 April Submission162 Authors

14 Apr 2024 (modified: 15 May 2024)ACL ARR 2024 April SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: A multimodal large language model MLLMs may struggle with answering visual-based (personal) entity questions (VEQA), such as ''who is A?'' or ''who is A that B is talking to?'' for various reasons, eg. the absence of the name of A in the caption or the inability of MLLMs to recognize A, particularly for less common entities. Furthermore, even if the MLLMs can identify A, it may refrain from answering due to privacy concerns. In this paper, we introduce a novel methodology called Matching-Augmented Reasoning (MAR) to enhance VEQA. Given a collection of visual objects with captions, MAR preprocesses each object individually, identifying faces, names, and their alignments within the object. It encodes this information and stores their vector representations in vector databases. When handling VEQA, MAR retrieves matching faces and names and organizes these entities into a matching graph, where nodes represent entities and edges indicate their similarities. MAR then derives the answer to the query by reasoning over this matching graph. Extensive experiments show that MAR significantly improves VEQA compared with the state-of-the-art methods using MLLMs.
Paper Type: Long
Research Area: Question Answering
Research Area Keywords: Multimodality and Language Grounding to Vision, Robotics and Beyond; Question Answering
Contribution Types: Model analysis & interpretability, Approaches to low-resource settings, Data analysis
Languages Studied: English
Submission Number: 162
Loading