Network Dynamics Reasoning: A Novel Benchmark for Evaluating Multi-Step Inference in Large Language Models

Published: 24 Sept 2025, Last Modified: 24 Sept 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: network dynamics, threshold models, social influence, cascade effects, multi-step reasoning, large language models, benchmark, contamination resistance, temporal dynamics, emergent abilities
TL;DR: We introduce a synthetic benchmark that tests large language models on predicting the outcomes of threshold-based network dynamics, revealing scaling-dependent reasoning abilities.
Abstract: We introduce a novel benchmark for evaluating large language models' ability to reason about network dynamics and multi-step system evolution. Our benchmark tests models on predicting the final state of threshold-based adoption processes in social networks, requiring precise numerical prediction after complex temporal reasoning. We evaluate five state-of-the-art models across different architectures and API providers, revealing significant performance gaps and emergent reasoning capabilities. Our key findings show that Google's Gemini models substantially outperform Meta's Llama and Google's Gemma models, with Gemini 1.5 Pro achieving 55\% accuracy compared to 10\% for Llama 3.3 70B, despite the latter's larger parameter count. This benchmark addresses critical gaps in current LLM evaluation by testing contamination-resistant synthetic scenarios, precise numerical reasoning, and multi-step temporal dynamics—capabilities essential for AI systems operating in complex real-world environments.
Submission Number: 196
Loading