Harnessing LLMs for VQA: A Prompted Benchmark with Animate/Inanimate Keywords

Published: 2024, Last Modified: 14 Nov 2025ICTC 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: In the field of NLP, Large Language Models (LLMs) have recently achieved significant advancements, leading to the development of various benchmarks for their evaluation. Along-side NLP, Vision Language Models (VLMs) have also VLM have also significantly progressed, similar to LLMs. However, benchmarks for VLMs are still relatively underdeveloped compared to those for NLP, and their construction is often costly. In this work, we propose an automatically generated benchmark for evaluating VLMs based on LLMs and conduct a visual question answering task to assess this benchmark. The benchmark includes multiple-choice questions that not only distinguish between animate and inanimate objects but also generate these distinctions automatically, along with entity and object information within images. We evaluate the performance of open VLM using the generated multiple-choice questions, demonstrating the model's capabilities and the significance of the automatically generated benchmark. Finally, we discuss the necessity and future directions for benchmark research in this area.
Loading