CoDePlot: Evaluating the Chart Code Generation Capabilities of Large Vision Language Models on Realistic Charts

ACL ARR 2025 February Submission2041 Authors

14 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large vision language models (VLMs) are increasingly used to solve tasks involving non-natural images such as charts, figures and diagrams. While VLMs often exhibit impressive capabilities in processing these images, there remains a gap in evaluation. Indeed, despite the fact that non-natural images play a significant role in many real-world applications, the vast majority of current benchmarks still focuses on natural images. We take a step toward closing this gap by introducing the CoDePlot benchmark, a challenging, novel and realistic dataset of 3k (chart, code) pairs obtained via heavy VLM-based filtering of permissively licensed Python Notebooks from Github. Along with our benchmark, we introduce a fine-grained rating system for comparing two charts according to different aspects (e.g., style and faithfulness), which allows VLMs-as-a-judge to obtain a high correlation with human raters. Using this system, we find that chart code generation is hard even for the highest-performing VLMs, with Gemini 2.0 Flash scoring at 82.6% and the best Open Weight model lagging behind at 49.9% on the hard benchmark examples. Finally, we introduce a training method which views chart code generation as Inverse Rendering to improve VLMs on CoDePlot. We use Inverse Rendering Training to train a small PaliGemma-3B model to score 57.8% --- better than its substantially larger counterparts.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: multimodality,benchmarking,evaluation methodologies,evaluation,metrics
Contribution Types: NLP engineering experiment, Data resources, Data analysis
Languages Studied: English
Submission Number: 2041
Loading