Automated Context-Aware Navigation Support for Individuals with Visual Impairment Using Multimodal Language Models in Urban Environments
Keywords: accessibility, navigation support, vision language model, VLM, multimodal language model
Abstract: Vision transformer capabilities for images have increased significantly in recent years. Multimodal vision transformers are now able to generate accurate captions for images and demonstrate strong capabilities in understanding images. More recently, these models have been built to handle videos, with or without audio. However, these transformers have seldom been trained on datasets related to accessibility. In this study, we focus on generating navigation instructions for individuals with visual impairment in the context of outdoor, urban environments. We use the spatial-temporal vision language model, VideoLLaMA3, to process videos and generate a series of instructions based on a prompt specifically designed for individuals with visual impairments. With our approach, we were able to surpass the performance of GPT-4o. In the future, we anticipate this approach being extended through the use of landmark detection and improved fine-tuning. In this work, we investigate the use of VLMs as a backbone within a pipeline that incorporates prompting, postprocessing and other techniques to develop spatially and temporally accurate instructions.
Submission Number: 12
Loading