Keywords: biological motion, multimodal large language models, point-light displays, action processing, foundation models
TL;DR: A preliminary benchmark evaluating point-light action understanding (biological motion) in large language models
Abstract: Humans can extract rich semantic information from minimal visual cues, as demonstrated by point-light displays (PLDs), which consist of sparse sets of dots localized to key joints of the human body. Multimodal large language models (MLLMs), despite demonstrating progress on various multimodal tasks, currently lack such structural and semantic abstraction required to interpret human motion. Since PLDs isolate body motion as the sole source of meaning, they present a key stimulus for testing the constraints of action understanding in these systems. Here we introduce ActPLD, the first benchmark to evaluate action processing in MLLMs from PLDs. Tested models include state-of-the-art proprietary and open-source systems on single-actor and socially interacting PLDs. Our results reveal consistently low performance across models, introducing fundamental gaps in action processing and spatiotemporal understanding.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 21246
Loading