Keywords: large language model, multi-modal, table classification
Abstract: Reasoning over structured tabular data remains a persistent challenge for large language models (LLMs), primarily due to the textual bottleneck: standard approaches serialize tables into raw text, fragmenting cell-level semantics into incoherent token sequences, while specialized models like TableGPT2 often sacrifice fine-grained detail through aggressive column-wise aggregation. We propose TLM (Table-Language Model), a multimodal framework that treats each table cell as an atomic semantic unit and preserves structural integrity natively. TLM integrates a lightweight, structure-aware table encoder with a frozen LLM backbone, explicitly modeling row–column dependencies and enabling the model to perceive tables as coherent two-dimensional grids rather than disjointed strings. Crucially, by aligning compact table representations with the LLM’s embedding space, our approach leverages pre-trained linguistic priors as a stable reasoning engine—without exposing it to raw numerical noise or excessive sequence length. Evaluated on zero-shot table classification benchmarks, TLM achieves a mean accuracy of 0.7503, outperforming TableGPT2 by 20.39% and even surpassing XGBoost by 14.75% on complex relational reasoning tasks. The model attains this performance with dramatically shorter inputs, offering not only higher accuracy but also improved reliability and interpretability through preserved cell-level granularity. Code will be released upon publication.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: cross-modal application,cross-modal information extraction,multimodality
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: English
Submission Number: 1486
Loading