Long-Form Answers to Visual Questions from Blind and Low Vision People

Mina Huh; Fangyuan Xu; Yi-Hao Peng; Chongyan Chen; Danna Gurari; Eunsol Choi; Amy Pavel

Long-Form Answers to Visual Questions from Blind and Low Vision People

Mina Huh, Fangyuan Xu, Yi-Hao Peng, Chongyan Chen, Danna Gurari, Eunsol Choi, Amy Pavel

Published: 03 Jun 2025, Last Modified: 03 Jun 2025CVPR 2025 DemoDivEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Visual Question Answering, Long-form Question Answering, Vision Language Models, Accessibility

TL;DR: We explore how human experts and VLMs generate long-form answers to visual questions from blind and low vision people and evaluate them.

Abstract: Vision language models can now generate long-form answers to questions about images – long-form visual question answers (LFVQA). We contribute VizWiz-LF, a dataset of long-form answers to visual questions posed by blind and low vision (BLV) users. VizWiz-LF contains 4.2k long-form answers to 600 visual questions, collected from human expert describers and 6 VQA models. We develop and annotate functional roles of sentences of LFVQA and demonstrate that long-form answers contain information beyond the question answer such as explanations and suggestions. We further conduct automatic and human evaluations with BLV and sighted people to evaluate long-form answers. BLV people perceive both human-written and generated long-form answers to be plausible, but generated answers often hallucinate incorrect visual details, especially for unanswerable visual questions (e.g., blurry or irrelevant images).

Submission Number: 9

Loading