Keywords: large language models, graph generation, structural reasoning, graph hallucination, specialized graph tasks, benchmark evaluation
TL;DR: Evaluating 15 LLMs on five graph-generation tasks with three prompting styles, we find reasoning-enhanced models solve over two-times more tasks than general-purpose ones, showing graph-savvy ability stems from architectural design, not just scale
Abstract: While large language models (LLMs) have demonstrated impressive capabilities across diverse tasks, their ability to generate valid graph structures remains underexplored. We evaluate fifteen state-of-the-art LLMs on five specialized graph generation tasks spanning delivery networks, social networks, quantum circuits, gene-disease networks, and transportation systems. We also test the LLMs using 3 different prompt types: direct, iterative feedback, and program-augmented. Models supported with explicit reasoning modules (o3-mini-high, o1, Claude 3.7 Sonnet, DeepSeek-R1) solve more than twice as many tasks as their general-purpose peers, independent of parameter count. Error forensics reveals two recurring failure modes: Llama-family models often violate basic structural constraints, whereas Claude models respect topology but mismanage higher-order logical rules. Allowing models to iteratively refine their answers yields uneven gains, underscoring fundamental differences in error-correction capacity. This work demonstrates that graph competence stems from specialized architectural design rather than scale, establishing a framework for developing truly graph-savvy language models. Results and verification scripts available at https://github.com/anonymized-for-the-blind-review.
Archival Status: Archival
Paper Length: Long Paper (up to 8 pages of content)
Submission Number: 229
Loading