Worldwide LiveVQA: Real-Time Visual Knowledge Seeking and Updating Across Languages

ACL ARR 2026 January Submission3958 Authors

04 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Benchmark & Dataset, Synthetic Data, Visual Knowledge, VQA, Multimodal LLM, News, Factuality QA
Abstract: Knowledge about the visual world is not only constantly evolving but also inherently happening all over the world: breaking news in Tokyo, political events in São Paulo, and cultural phenomena in Cairo are first reported in Japanese, Portuguese, and Arabic, carrying regional context that English-centric resources cannot fully capture. Yet existing resources for visual knowledge remain confined to English, creating a "Worldwide Knowledge Gap" that hinders developing truly global assistants. To quantify this gap, we introduce LiveVQA-W(orldwide), the first dynamic-updating dataset for real-time, multilingual visual knowledge seeking and updating across ten major languages. Drawing from worldwide news outlets, YouTube videos, and academic platforms during August–December 2025, LiveVQA-W comprises 234K images, 872K questions, and 171K visual entities with hierarchical evaluation: Level 1 for visual entity recognition and Level 2 for multi-hop cross-lingual reasoning. Our comprehensive benchmarking of 15 state-of-the-art MLLMs reveals that models without search achieve near-random performance, while search-augmented models exhibit severe linguistic bias, with English accuracy nearly double that of other languages. Furthermore, we explore visual knowledge updating through large-scale training, finding that injected knowledge improves recall but remains fragile under prompt rephrasing and image perturbations such as rotation and flipping. We release the fully replicable data collection pipeline and dataset to support continuous community-driven expansion.
Paper Type: Long
Research Area: Language Models
Research Area Keywords: Language Modeling,Multilingualism and Cross-Lingual NLP,Multimodality and Language Grounding to Vision, Robotics and Beyond,Question Answering,Resources and Evaluation
Contribution Types: Model analysis & interpretability, Approaches to low-resource settings, Data resources, Data analysis
Languages Studied: English, Chinese, Spanish, Arabic, Portuguese, Indonesian/Malay, French, Japanese, Russian, German
Submission Number: 3958
Loading