Benchmarking Multi-Hop Reasoning with Controllable Synthetic Datasets

Benchmarking Multi-Hop Reasoning with Controllable Synthetic Datasets

ACL ARR 2026 January Submission2093 Authors

01 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: temporal knowledge graphs, large language models, multi-hop reasoning

Abstract: Large Language Model (LLM)-based methods for Temporal Knowledge Graph (TKG) reasoning tasks have found success by relying on LLMs' impressive pattern recognition abilities. But the extent of this capability on complex multi-hop patterns is understudied. In order to understand the limits of LLM-based methods' multi-hop reasoning abilities, we create a novel synthetic TKG generator and a suite of realistic TKG datasets with varied complexity along several important dimensions. In particular, we study multi-hop patterns that have been complicated by number of hops, time dispersion, and imbalanced relation and entity distributions. We benchmark the abilities of LLM- and Graph Neural Network (GNN)-based methods on these synthetic TKGs, finding that LLMs can far outperform GNN-based methods when provided with ideal contexts. However, their performance degrades sharply as contextual noise increases, indicating that retrieval, not multi-hop composition itself, is the primary bottleneck.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: corpus creation,benchmarking,automatic creation and evaluation of language resources,evaluation,neurosymbolic approaches,prompting

Contribution Types: Model analysis & interpretability, Data resources, Data analysis

Languages Studied: English

Submission Number: 2093

Loading