Chart2Code53: A Large-Scale Diverse and Complex Dataset for Enhancing Chart-to-Code Generation

ACL ARR 2025 May Submission6843 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Chart2code has recently received significant attention in the multimodal community due to its potential to reduce the burden of visualization and promote a more detailed understanding of charts. However, existing Chart2code-related training datasets suffer from at least one of the following issues: (1) limited scale, (2) limited type coverage, and (3) inadequate complexity. To address these challenges, we seek more diverse sources that better align with real-world user distributions and propose dual data synthesis pipelines: (1) synthesize based on online plotting code. (2) synthesize based on chart images in the academic paper. We create a large-scale Chart2code training dataset Chart2code53, including 53 chart types, 130K Chart-code pairs based on the pipeline. Experimental results demonstrate that even with few parameters, the model finetuned on Chart2code53 achieves state-of-the-art performance on multiple Chart2code benchmarks within open-source models.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: Multimodality and Language Grounding to Vision, Robotics and Beyond, Resources and Evaluation
Contribution Types: Model analysis & interpretability, Data resources, Data analysis
Languages Studied: English
Submission Number: 6843
Loading