Keywords: Everyday knowledge, Multimodal Question Answering, Culturally Situated QA, Underrepresented Languages
TL;DR: EverydayMMQA: ~1M images with 14.8M QAs (+3.7M spoken) for culturally grounded English/Arabic VQA, supporting speech/text × image with text‑only answers for SFT and benchmarking.
Abstract: Large-scale multimodal models achieve strong results on tasks like Visual Question Answering (VQA), but they often fail when queries require culturally grounded, everyday knowledge, particularly in low-resource and underrepresented languages. To bridge this gap, we introduce **Everyday Multimodal and Multilingual QA (EverydayMMQA)**, a framework for creating large-scale, culturally grounded datasets for spoken and visual question answering (SVQA). Using this framework, we developed **OASIS**, a multimodal dataset integrating speech, images, and text. With over ∼0.92M images and 14.8M QA pairs, OASIS contains 3.7M spoken questions, enabling four unique input combinations: speech-only, text-only, speech+image, and text+image. Focused on English and Arabic varieties across 18 countries, the dataset content is curated to reflect diverse, real-world situations. OASIS tests models on tasks beyond object recognition that involve pragmatic, commonsense, and culturally aware reasoning. We benchmarked four closed-source models, three open-source models, and one fine-tuned model. EverydayMMQA and OASIS together provide both a benchmark and a training dataset for building multimodal LLMs capable of handling a comprehensive set of everyday tasks within cultural contexts. The framework and dataset will be made publicly available to the community.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 23278
Loading