OWLViz: An Open-World Benchmark for Visual Question Answering

ACL ARR 2025 July Submission744 Authors

28 Jul 2025 (modified: 24 Aug 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: We present a challenging benchmark for the Open WorLd VISual (OWLViz) question answering benchmark. \name~ presents short queries that require integrating multiple capabilities, including common-sense knowledge, visual understanding, web exploration, and specialized tool usage. While humans achieve 69.2% accuracy on these intuitive tasks, even state-of-the-art VLMs struggle, with the best model, Gemini 2.0, achieving only 26.6% accuracy. Current agentic VLMs, which rely on limited vision and vision-language models as tools, perform even worse. This performance gap reveals significant limitations in multimodal systems' ability to select appropriate tools and execute complex reasoning sequences, establishing new directions for advancing practical AI research.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: Open World, Tool Use, VQA
Contribution Types: Data resources
Languages Studied: English
Previous URL: https://openreview.net/forum?id=oxgRh77rxv
Explanation Of Revisions PDF: pdf
Reassignment Request Area Chair: No, I want the same area chair from our previous submission (subject to their availability).
Reassignment Request Reviewers: No, I want the same set of reviewers from our previous submission (subject to their availability)
A1 Limitations Section: This paper has a limitations section.
A2 Potential Risks: No
A2 Elaboration: This paper offers a small dataset for evaluating system. We did not release the golden data to prevent data contamination. We don't foresee critical potential risks.
B Use Or Create Scientific Artifacts: Yes
B1 Cite Creators Of Artifacts: Yes
B1 Elaboration: Section 1, 2, 3.5, 4
B2 Discuss The License For Artifacts: Yes
B2 Elaboration: Ethical consideration
B3 Artifact Use Consistent With Intended Use: Yes
B3 Elaboration: Ethical consideration
B4 Data Contains Personally Identifying Info Or Offensive Content: No
B4 Elaboration: The dataset does not include personal information.
B5 Documentation Of Artifacts: Yes
B5 Elaboration: Section 3
B6 Statistics For Data: Yes
B6 Elaboration: Section 3
C Computational Experiments: Yes
C1 Model Size And Budget: Yes
C1 Elaboration: Appendix B (Experiment Details)
C2 Experimental Setup And Hyperparameters: Yes
C2 Elaboration: Appendix B (Experiment Details)
C3 Descriptive Statistics: No
C3 Elaboration: The results were computed based on a single run
C4 Parameters For Packages: N/A
D Human Subjects Including Annotators: No
D1 Instructions Given To Participants: No
D1 Elaboration: We do not use crowd workers.
D2 Recruitment And Payment: N/A
D2 Elaboration: We do not use crowd workers.
D3 Data Consent: Yes
D3 Elaboration: We discuss this for evaluation, no golden data is released.
D4 Ethics Review Board Approval: N/A
D5 Characteristics Of Annotators: No
D5 Elaboration: The authors are annotators.
E Ai Assistants In Research Or Writing: No
E1 Information About Use Of Ai Assistants: No
E1 Elaboration: We wrote it ourselves.
Author Submission Checklist: yes
Submission Number: 744
Loading