Keywords: Dataset, Synthetic Data, Visual Knowledge, VQA, Multimodal LLM, RAG, Knowledge Update
TL;DR: We introduce a visual knowledge dataset aiming to evaluate current models' latest visual knowledge seeking and updating capabilities.
Abstract: The visual world around us constantly evolves, from real-time news and social media trends to global infrastructure changes visible through satellite imagery and augmented reality enhancements. However, Multimodal Large Language Models (MLLMs), which automate many tasks, struggle to stay current, limited by the cutoff dates in their fixed training datasets.
To quantify this stagnation, we introduce LiveVQA, the first-of-its-kind dataset featuring 107,143 samples and 12 categories data specifically designed to support research in both seeking and updating with live visual knowledge.
Drawing from recent news articles, video platforms, and academic publications in April 2024-May 2025, LiveVQA enables evaluation of how models handle latest visual information beyond their knowledge boundaries and how current methods help to update them.
Our comprehensive benchmarking of 17 state-of-the-art MLLMs reveals significant performance gaps on content beyond knowledge cutoff, and tool-use or agentic visual seeking framework drastically gain an average of 327% improvement.
Furthermore, we explore parameter-efficient fine-tuning methods to update MLLMs with new visual knowledge.
We dive deeply to the critical balance between adapter capacity and model capability when updating MLLMs with new visual knowledge.
All the experimental dataset and source code are publicly available at: https://livevqa.github.io.
Croissant File: json
Dataset URL: https://huggingface.co/datasets/ONE-Lab/LiveVQA-new
Code URL: https://github.com/fumingyang2004/LIVEVQA
Supplementary Material: zip
Primary Area: Datasets & Benchmarks for applications in language modeling and vision language modeling
Submission Number: 1549
Loading