Keywords: Implicit Motion Blindness, MLLMs, Accessibility, Human-Centered AI, Position Paper
TL;DR: This position paper highlights "Implicit Motion Blindness"—MLLMs' inability to detect subtle motion—as a key flaw in video understanding, undermining user trust. We call for a shift from semantic recognition to physical perception.
Abstract: Multimodal Large Language Models (MLLMs) hold immense promise as assistive technologies for the blind and visually impaired (BVI) community.
However, we identify a critical failure mode that undermines their trustworthiness in real-world applications.
We introduce the ***Escalator Problem***---the inability of state-of-the-art models to perceive an escalator's direction of travel---as a canonical example of a deeper limitation we term ***Implicit Motion Blindness***.
This blindness stems from the dominant frame-sampling paradigm in video understanding, which, by treating videos as discrete sequences of static images, fundamentally struggles to perceive continuous, low-signal motion.
As a position paper, our contribution is not a new model but rather to: (I) formally articulate this blind spot, (II) analyze its implications for user trust, and (III) issue a call to action.
We advocate for a paradigm shift from purely semantic recognition towards robust physical perception and urge the development of new, human-centered benchmarks that prioritize safety, reliability, and the genuine needs of users in dynamic environments.
Submission Number: 2
Loading