VIVA: A Benchmark for Vision-Grounded Decision-Making with Human Values

ACL ARR 2024 June Submission3619 Authors

16 Jun 2024 (modified: 08 Aug 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: This paper introduces VIVA, a benchmark for VIsion-grounded decision-making driven by human VA. While most large vision-language models (VLMs) focus on physical-level skills, our work is the first to examine their multimodal capabilities in leveraging human values to make decisions under a vision-depicted situation. VIVA contains 1,062 images depicting diverse real-world situations and the manually annotated decisions grounded in them. Given an image there, the model should select the most appropriate action to address the situation and provide the relevant human values and reason underlying the decision. Extensive experiments based on VIVA show the limitation of VLMs in using human values to make multimodal decisions. Further analyses indicate the potential benefits of exploiting action consequences and predicted human values.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: benchmarking, cross-modal application, human-subject application-grounded evaluations, values and culture; human-centered evaluation
Contribution Types: Data resources
Languages Studied: English
Submission Number: 3619