Long-Form Answers to Visual Questions from Blind and Low Vision People

Published: 10 Jul 2024, Last Modified: 26 Aug 2024COLMEveryoneRevisionsBibTeXCC BY 4.0
Research Area: Data, Evaluation, Societal implications, LMs on diverse modalities and novel applications
Keywords: Visual Question Answering, Long-form Question Answering, Vision Language Models, Accessibility
TL;DR: We explore how human experts and VLMs generate long-form answers to visual questions from blind and low vision people and evaluate them.
Abstract: Vision language models can now generate long-form answers to questions about images – long-form visual question answers (LFVQA). We contribute VizWiz-LF, a dataset of long-form answers to visual questions posed by blind and low vision (BLV) users. VizWiz-LF contains 4.2k long-form answers to 600 visual questions, collected from human expert describers and six VQA models. We develop and annotate functional roles of sentences of LFVQA and demonstrate that long-form answers contain information beyond the question answer such as explanations and suggestions to retake photos. We further conduct automatic and human evaluations involving BLV and sighted people to evaluate long-form answers. While BLV people perceive both human-written and generated long-form answers as plausible, generated answers often hallucinate incorrect visual details, especially for unanswerable visual questions (e.g., blurry or irrelevant images). To reduce hallucinations, we evaluate VQA models on their ability to abstain from answering unanswerable questions.
Supplementary Material: zip
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html
Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html
Submission Number: 522