Visual-TableQA: Open-Domain Benchmark for Reasoning over Table Images

Published: 23 Sept 2025, Last Modified: 07 Dec 2025FoRLM 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Dataset Benchmark, Multimodality, Opendomain
Abstract: Visual reasoning over structured data such as tables is a critical capability for modern vision-language models (VLMs), yet current benchmarks remain limited in scale, diversity, or reasoning depth, especially when it comes to rendered table images. Addressing this gap, we introduce \textbf{Visual-TableQA}, a large-scale, open-domain multimodal dataset specifically designed to evaluate and enhance visual reasoning over complex tabular data. Our generation pipeline is \textbf{modular, scalable, and fully autonomous}, involving multiple reasoning LLMs collaborating across distinct roles: generation, validation, and inspiration. \textbf{Visual-TableQA} comprises 2.5k richly structured LaTeX-rendered tables and 6k reasoning-intensive QA pairs, all produced at a cost of under \$100. To promote diversity and creativity, our pipeline performs \textbf{multi-model collaborative data generation} via \textbf{cross-model prompting (‘inspiration’)} and LLM-jury filtering. Stronger models seed layouts and topics that weaker models elaborate, collectively distilling diverse reasoning patterns and visual structures into the dataset. Empirical results show that models fine-tuned on \textbf{Visual-TableQA} generalize robustly to external benchmarks, outperforming several proprietary models despite the dataset’s synthetic nature. The full pipeline and resources are publicly available in our \href{https://github.com/AI-4-Everyone/Visual-TableQA}{GitHub repository}.
Submission Number: 213
Loading