Evaluating Multimodal Language Models as Visual Assistants for Visually Impaired Users

ACL ARR 2025 February Submission4312 Authors

15 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract:

This paper explores the effectiveness of Multimodal Large Language models (MLLMs) as assistive technologies for visually impaired individuals. We conduct a user survey to identify adoption patterns and key challenges users face with such technologies. Despite a high adoption rate of these models, our findings highlight concerns related to contextual understanding, cultural sensitivity, and complex scene understanding, particularly for individuals who may rely solely on them for visual interpretation. Informed by these results, we collate five user-centred tasks with image and video inputs, including a novel task on Optical Braille Recognition. Our systematic evaluation of twelve MLLMs reveals that further advancements are necessary to overcome limitations related to cultural context, multilingual support, Braille reading comprehension, assistive object recognition, and hallucinations. This work provides critical insights into the future direction of multimodal AI for accessibility, underscoring the need for more inclusive, robust, and trustworthy visual assistance technologies.

Paper Type: Long
Research Area: Human-Centered NLP
Research Area Keywords: human-AI interaction; human factors in NLP; human-centered evaluation
Contribution Types: Publicly available software and/or pre-trained models, Data resources, Data analysis
Languages Studied: Arabic, Bengali, Czech, Danish, German, Greek, English, Spanish, Persian, Finnish, Filipino, French, Hebrew, Hindi, Croatian, Hungarian, Indonesian, Italian, Japanese, Korean, Māori, Dutch, Norwegian, Polish, Portuguese, Romanian, Russian, Swedish, Swahili, Telugu, Thai, Turkish, Ukrainian, Vietnamese, Chinese
Submission Number: 4312
Loading