Guiding Multimodal Large Language Models with Blind and Low Vision Visual Questions for Proactive Visual Interpretations
Keywords: Blind, Low Vision, Visual interpretation, RAG, Context Retrieval, Visual Interpretation Systems, Captioning, MLLM, Multimodal Large Language Models
TL;DR: We present a context-retrieval framework that leverages historical visual questions from BLV users to steer multimodal large language models toward relevant image descriptions, providing higher accuracy over baseline captions.
Abstract: Multimodal large language models (MLLMs) have been integrated into visual interpretation applications to support Blind and Low Vision (BLV) users because of their high accuracy and ability to provide rich, human-like interpretations. However, these applications often default to comprehensive, lengthy descriptions regardless of context. This leads to inefficient exchanges, as users must go through irrelevant details rather than receiving the specific information they are likely to seek. To deliver more contextually-relevant information, we developed a system that draws on historical BLV user questions. When given an image, our system identifies similar past visual contexts from the VizWiz-LF Dataset and uses the associated questions to guide the MLLM generate descriptions more relevant to BLV users. An evaluation with three human labelers who revised 92 context-aware and context-free descriptions showed that context‐aware descriptions anticipated and answered users’ questions in 76.1\% of cases (70 out of 92) and were preferred in 54.4\% of comparisons (50 out of 92).
Supplementary Material: zip
Submission Number: 18
Loading