ChartAgent: A Modular Agentic Framework for Accurate Chart-to-Table Extraction with Visual Zooming

ChartAgent: A Modular Agentic Framework for Accurate Chart-to-Table Extraction with Visual Zooming

ACL ARR 2025 May Submission4034 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Extracting structured tables from chart images is a challenging task that underpins numerous downstream document analysis applications. While previous studies have demonstrated that multimodal large language models (MLLMs) and vision-language models (VLMs) can convert charts into tables, these models frequently fail to adhere to strict formatting standards, omit fine-grained labels, or introduce numerical inaccuracies. In this work, we introduce ChartAgent, a plug-and-play, agent-based framework that augments any off-the-shelf VLM through a two-stage agentic pipeline. In the first stage, a chart-to-table pretrained VLM generates an initial table directly from the chart image. In the second stage, a ReAct LLM-based agent iteratively corrects the generated table by cross-verifying visual regions and textual entries. This agent can optionally utilize a novel zooming tool designed for detailed and precise inspection of complex, densely packed chart areas. To evaluate the effectiveness of ChartAgent, we benchmarked its performance on the ChartQA dataset against state-of-the-art methods. Our experiments demonstrate consistent improvements over both VLM-only and single-pass correction baselines across structural and numerical metrics. The modular design of ChartAgent enables seamless integration with any VLM without requiring additional fine-tuning. This approach significantly enhances header alignment, numerical fidelity, and overall table quality, providing a robust and efficient solution for accurate chart-to-table extraction.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: vision language navigation; cross-modal pretraining; image text matching; cross-modal content generation; vision question answering; cross-modal application; cross-modal information extraction; multimodality

Languages Studied: English

Submission Number: 4034

Loading