Multimodal Dietary Knowledge Graph-Driven Visual Language Model for Food Question Answering

ACL ARR 2025 May Submission4336 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Food analysis is crucial for personalized nutrition guidance and disease management. However, existing Visual Language Models (VLMs) have limitations in understanding deep, multi-dimensional food knowledge, such as nutritional composition, cultural background, and health impacts. Current food datasets and knowledge graphs often focus on textual knowledge, lacking visual information or failing to integrate cross-domain knowledge. To address these challenges, we constructed DietKG-VQA—the first large-scale food analysis benchmark (3404 images, 10219 question-answering pairs) that fuses multi-domain (nutrition, culture, health) structured knowledge with visual information. We also propose a novel method for enhancing VLMs based on a Multimodal Dietary Knowledge Graph (MDKG): by constructing an MDKG that incorporates visual information, and combining visual similarity retrieval, knowledge graph querying, and our proposed VLM-guided Knowledge Pruning \& Selection (V-KPS) mechanism, we precisely extract core knowledge to enhance VLM reasoning, especially for uncommon food items. Experimental results on the DietKG-VQA benchmark show that the proposed method significantly outperforms baseline VLMs; for example, gpt\_4o\_mini's comprehensive average score increased substantially from a baseline of 34.81\% to 76.02\%. The DietKG-VQA benchmark and related code will be publicly released.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: vision question answering,multimodal QA,knowledge graphs
Contribution Types: Model analysis & interpretability, Data resources
Languages Studied: English,Chinese
Submission Number: 4336
Loading