Abstract: Chart parsing poses a significant challenge due to the diversity of styles, values, texts, and so forth. Even advanced large vision-language models (LVLMs) with billions of parameters struggle to handle such tasks satisfactorily. To address this, we propose OneChart: a reliable agent specifically devised for the structural extraction of chart information. Similar to popular LVLMs, OneChart incorporates an autoregressive main body. Uniquely, to enhance the reliability of the numerical parts of the output, we introduce an auxiliary token placed at the beginning of the total tokens along with an additional decoder. The numerically optimized (auxiliary) token allows subsequent tokens for chart parsing to capture enhanced numerical features through causal attention. Furthermore, with the aid of the auxiliary token, we have devised a self-evaluation mechanism that enables the model to gauge the reliability of its chart parsing results by providing confidence scores for the generated content. Compared to current state-of-the-art (SOTA) chart parsing models, e.g., DePlot, ChartVLM, ChartAst, OneChart significantly outperforms in Average Precision (AP) for chart structural extraction across multiple public benchmarks, despite enjoying only 0.2 billion parameters. Moreover, as a chart parsing agent, it also brings 10%+ accuracy gains for the popular LVLM (LLaVA-1.6) in the downstream ChartQA benchmark.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Content] Vision and Language, [Generation] Multimedia Foundation Models
Relevance To Conference: Multimodal reasoning that involves interpreting plots and charts is a critical capability. However, even advanced large VLMs with billions of parameters struggle to handle such tasks satisfactorily due to the diversity of styles, values, texts, and so forth. To address this, we propose OneChart: a reliable agent specifically devised for the structural extraction of chart information. This method, coupled with the introduction of an auxiliary token and the use of customized L1 loss, significantly reduces language supervision ambiguity and enhances the reliability of numerical outputs. Moreover, by developing a new benchmark, ChartSE, and a standardized chart-to-dict task, this work addresses the shortcomings of existing QA-type benchmarks, which fail to adequately assess a model's comprehension of chart information. OneChart not only demonstrates a substantial 20%+ average improvement in information extraction accuracy across various chart types compared to state-of-the-art models but also exhibits significant gains in downstream ChartQA benchmark. Consequently, this work sets a new standard in the automated interpretation of visual data, marking a pivotal step forward in the multimodal processing domain.
Supplementary Material: zip
Submission Number: 2844
Loading