Abstract: Tables serve as a core format for representing structured data on the web, as their two-dimensional layouts effectively encode complex inter-entity relationships. However, real-world web tables often feature heterogeneous structures and rich semantics. Accurately interpreting such tables requires not only spatial layout perception but also multi-step reasoning across rows and columns, posing substantial challenges to web intelligence systems. Multimodal large language models (MLLMs) show promise in table question answering (TableQA) by leveraging visual layouts. However, their performance on complex web tables remains uneven, as existing benchmarks often blur the impact of individual difficulty factors, hindering precise capability analysis. To advance TableQA beyond superficial task difficulty and toward interpretable capability modeling, we introduce MMTableBench, a multi-level benchmark that systematically evaluates MLLMs along two fine-grained dimensions: layout complexity and reasoning complexity. By organizing table-question pairs along these axes, MMTableBench facilitates a detailed evaluation of model performance under varying structural and reasoning challenges, while revealing the respective strengths and limitations of multimodal inputs. Our comprehensive analysis shows that state-of-the-art MLLMs continue to exhibit notable limitations when confronted with complex layouts and deep reasoning tasks, underscoring persistent gaps despite the structural advantages offered by visual inputs. MMTableBench thus provides not only a rigorous evaluation framework but also a diagnostic tool for analyzing and interpreting model behaviors, enabling more transparent and explainable progress in multimodal TableQA development.
Loading