Keywords: Multimodal Language Models, Graph Neural Networks, Large Language Models, Evaluations
TL;DR: We show current benchmarks are insufficient to evaluate multimodal graph-language reasoning, and introduce a new benchmark to address limitations.
Abstract: Recent developments in Graph-Language Models (GLMs) aim to combine the structural reasoning capabilities of Graph Neural Networks (GNNs) with the semantic understanding of Large Language Models (LLMs). In this work, we show that current benchmarks for graph-language tasks are inadequate for evaluating multimodal reasoning. We perform a diagnostic study, which reveals that high accuracy can be achieved using unimodal information without necessitating graph–language integration. To address this, we introduce CLEGR (Compositional Language Graph Reasoning), a benchmark that tests multimodal reasoning via synthetic graph generation paired with questions requiring joint structural and semantic reasoning. Evaluating representative GLM architectures on CLEGR, we find that soft-prompted LLM baselines rival GLMs with full GNN backbones. Moreover, GLMs degrade significantly in tasks that emphasize structural reasoning, highlighting the need for more advanced methods to integrate graph and language inputs. Our findings highlight key limitations in current GLMs and establish CLEGR as a step towards a rigorous evaluation of explicit graph–language reasoning.
Submission Number: 7
Loading