Abstract: Extracting structured tables from chart images is a challenging task that underpins numerous downstream document analysis applications. While previous studies have demonstrated that multimodal large language models (MLLMs) and vision-language models (VLMs) can convert charts into tables, these models frequently fail to adhere to strict formatting standards, omit fine-grained labels, or introduce numerical inaccuracies. In this work, we introduce ChartAgent, a plug-and-play, agent-based framework that augments any off-the-shelf VLM through a two-stage agentic pipeline. In the first stage, a chart-to-table pretrained VLM generates an initial table directly from the chart image. In the second stage, a ReAct LLM-based agent iteratively corrects the generated table by cross-verifying visual regions and textual entries. This agent can optionally utilize a novel zooming tool designed for detailed and precise inspection of complex, densely packed chart areas. To evaluate the effectiveness of ChartAgent, we benchmarked its performance on the ChartQA dataset against state-of-the-art methods. Our experiments demonstrate consistent improvements over both VLM-only and single-pass correction baselines across structural and numerical metrics. The modular design of ChartAgent enables seamless integration with any VLM without requiring additional fine-tuning. This approach significantly enhances header alignment, numerical fidelity, and overall table quality, providing a robust and efficient solution for accurate chart-to-table extraction.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: vision language navigation; cross-modal pretraining; image text matching; cross-modal content generation; vision question answering; cross-modal application; cross-modal information extraction; multimodality
Languages Studied: English
Submission Number: 4034
Loading