Precision in Knowledge Empowers, Excess is Distraction: Visual Question Answering with Knowledge-Infused Language ModelsDownload PDF

Anonymous

16 Feb 2024ACL ARR 2024 February Blind SubmissionReaders: Everyone
Abstract: In the realm of multimodal tasks, Visual Question Answering (VQA) plays a crucial role by addressing questions in natural language grounded in visual content. Knowledge-Based Visual Question Answering KBVQA elevates this concept by integrating external knowledge with images to respond to questions. KBVQA shows great potential in tackling real-world challenges, encompassing assistance for the visually impaired and enhancing image search functionalities. We introduce an innovative approach for KBVQA, augmenting the existing vision-language transformer encoder-decoder (OFA) model. Our main contribution involves enhancing questions by incorporating pertinent external knowledge extracted from knowledge graphs, using a \textit{dynamic triple extraction} method. We supply a flexible number of triples from the knowledge graph as context, tailored to meet the requirements for answering the question. Our model, enriched with knowledge, demonstrates an average improvement of 4.75% in Exact Match Score over the SOTA on three different KBVQA datasets. Through thorough experiments and analysis, we illustrate that furnishing variable triples for each question improves the reasoning capabilities of the language model in contrast to supplying a fixed number of triples. Additionally, we highlight the model's generalization capability by showcasing its SOTA-beating performance on a small dataset, achieved through straightforward fine-tuning.
Paper Type: long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Publicly available software and/or pre-trained models
Languages Studied: English
0 Replies

Loading