MobiSim-Bench: A Multi-Perspective Benchmark for Evaluating LLM-Agent-Based Human Mobility Simulation

MobiSim-Bench: A Multi-Perspective Benchmark for Evaluating LLM-Agent-Based Human Mobility Simulation

ICLR 2026 Conference Submission24001 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM Agent Simulation, Mobility Simulation, LLM Agent Benchmark

Abstract: With advances in large language models (LLMs) and agent technology, LLM agents are transforming social science research on human behavior simulation with their powerful role-playing capabilities. Among the simulation studies on complex human behaviors, mobility behavior simulation has been receiving widespread attention and has important implications for real-world applications. Unlike data-driven statistical learning approaches, LLM agent-based simulation methods have the potential to support all-day simulation and generation of human mobility behaviors or even simulation of adaptive changes in the environment in extraordinary scenarios. To evaluate the performance of LLM agents for human mobility behavior simulation from multiple perspectives and in a holistic manner, we first propose an evaluation framework, which contains three perspectives: **Robustness**, **Realism**, and **Responsiveness**. To implement the evaluation framework, we construct and publish a multi-perspective benchmark named **MobiSim-Bench** based on the AgentSociety simulation framework. The benchmark contains the **Daily Mobility Simulation** mainly for evaluating realism and the **Hurricane Mobility Simulation** mainly for evaluating responsiveness. Based on this benchmark, we organized a challenge with 18 teams to collect and evaluate LLM agents designed by different researchers. In this challenge, 967 agents were deployed. The agent design approach using LLM as the brain achieves the optimum in terms of realism, while the LLM as an extra is more suitable for the responsive scenario. The results show that our evaluation framework and benchmark do examine the performance of LLM agent for simulating human behavior from different perspectives, and on the other hand, they also reveal the shortcomings of the existing LLM agent designs, which will drive the research community to further explore the LLM agent design approaches that can satisfy robustness, realism and responsiveness simultaneously. The benchmark codes are available at https://anonymous.4open.science/r/MobiSim-Bench-1077/.

Primary Area: datasets and benchmarks

Submission Number: 24001

Loading