Advancement in Graph Understanding: A Multimodal Benchmark and Fine-Tuning of Vision-Language Models
Abstract: Graph data organizes complex relationships and interactions between objects, facilitating advanced analysis and decision-making across different fields.In this paper, we propose a new paradigm for interactive and instructional graph data understanding and reasoning.Instead of adopting complex graph neural models or heuristic graph-to-text instruction design, we leverage Vision-Language Models (VLMs) to encode the graph images with varying structures across different domains.This paper first evaluates the capabilities of public VLMs in graph learning from multiple aspects.Then it introduces a novel instruction-following dataset for multimodal graph understanding and reasoning in English and Chinese. Besides, by fine-tuning MiniGPT-4 and LLaVA on our dataset, we achieved an accuracy increase of 5\%-15\% compared to baseline models, with the best-performing model attaining scores comparable to Gemini in GPT-asissted Evaluation . This research not only showcases the potential of integrating VLMs with graph data but also opens new avenues for advancements in graph data understanding.
Paper Type: long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Contribution Types: Model analysis & interpretability, Data resources
Languages Studied: English, Chinese
0 Replies
Loading