Abstract: ChartAgent is a plug‑and‑play, agent‑based framework with a two‑stage pipeline: first a chart‑to‑table pretrained VLM generates an initial table from a chart image, then a ReAct LLM‑based agent iteratively corrects it, optionally using a novel zooming tool for fine‑grained inspection. Evaluated on ChartQA, ChartAgent consistently outperforms VLM‑only and single‑pass correction baselines in header alignment, numerical fidelity, and overall table quality, all without any additional fine‑tuning.
Abstract: Extracting structured tables from chart images is a challenging task that underpins numerous downstream document analysis applications. While previous studies have demonstrated that multimodal large language models (MLLMs) and vision-language models (VLMs) can convert charts into tables, these models frequently fail to adhere to strict formatting standards, omit fine-grained labels, or introduce numerical inaccuracies. In this work, we introduce ChartAgent, a plug-and-play, agent-based framework that augments any off-the-shelf VLM through a two-stage agentic pipeline. In the first stage, a chart-to-table pretrained VLM generates an initial table directly from the chart image. In the second stage, a ReAct LLM-based agent iteratively corrects the generated table by cross-verifying visual regions and textual entries. This agent can optionally utilize a novel zooming tool designed for detailed and precise inspection of complex, densely packed chart areas. To evaluate the effectiveness of ChartAgent, we benchmarked its performance on the ChartQA dataset against state-of-the-art methods. Our experiments demonstrate consistent improvements over both VLM-only and single-pass correction baselines across structural and numerical metrics. The modular design of ChartAgent enables seamless integration with any VLM without requiring additional fine-tuning. This approach significantly enhances header alignment, numerical fidelity, and overall table quality, providing a robust and efficient solution for accurate chart-to-table extraction.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: vision language navigation; cross-modal pretraining; image text matching; cross-modal content generation; vision question answering; cross-modal application; cross-modal information extraction; multimodality
Languages Studied: English
Previous URL: https://openreview.net/forum?id=SyHGjderZ8
Explanation Of Revisions PDF: pdf
Reassignment Request Area Chair: Yes, I want a different area chair for our submission
Reassignment Request Reviewers: Yes, I want a different set of reviewers
Software: zip
A1 Limitations Section: This paper has a limitations section.
A2 Potential Risks: No
B Use Or Create Scientific Artifacts: Yes
B1 Cite Creators Of Artifacts: Yes
B1 Elaboration: 4
B2 Discuss The License For Artifacts: N/A
B3 Artifact Use Consistent With Intended Use: N/A
B4 Data Contains Personally Identifying Info Or Offensive Content: N/A
B5 Documentation Of Artifacts: N/A
B6 Statistics For Data: N/A
C Computational Experiments: Yes
C1 Model Size And Budget: Yes
C1 Elaboration: 4
C2 Experimental Setup And Hyperparameters: Yes
C2 Elaboration: 4
C3 Descriptive Statistics: Yes
C3 Elaboration: 4
C4 Parameters For Packages: N/A
D Human Subjects Including Annotators: No
D1 Instructions Given To Participants: N/A
D2 Recruitment And Payment: N/A
D3 Data Consent: N/A
D4 Ethics Review Board Approval: N/A
D5 Characteristics Of Annotators: N/A
E Ai Assistants In Research Or Writing: Yes
E1 Information About Use Of Ai Assistants: No
E1 Elaboration: it was used only for grammar and vocabulary correction.
Author Submission Checklist: yes
Submission Number: 1370
Loading