Evaluating point-light biological motion in multimodal large language models

19 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: biological motion, multimodal large language models, point-light displays, action processing, foundation models
TL;DR: A preliminary benchmark evaluating point-light action understanding (biological motion) in large language models
Abstract: Humans can extract rich semantic information from minimal visual cues, as demonstrated by point-light displays (PLDs), which consist of sparse sets of dots localized to key joints of the human body. Multimodal large language models (MLLMs), despite demonstrating progress on various multimodal tasks, currently lack such structural and semantic abstraction required to interpret human motion. Since PLDs isolate body motion as the sole source of meaning, they present a key stimulus for testing the constraints of action understanding in these systems. Here we introduce ActPLD, the first benchmark to evaluate action processing in MLLMs from PLDs. Tested models include state-of-the-art proprietary and open-source systems on single-actor and socially interacting PLDs. Our results reveal consistently low performance across models, introducing fundamental gaps in action processing and spatiotemporal understanding.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 21246
Loading