Submission Type: Regular Short Paper
Submission Track: Resources and Evaluation
Keywords: text generation, scientific figure caption, caption evaluation
Abstract: There is growing interest in systems that generate captions for scientific figures.
However, assessing these systems' output poses a significant challenge.
Human evaluation requires academic expertise and is costly, while automatic evaluation depends on often low-quality author-written captions.
This paper investigates using large language models (LLMs) as a cost-effective, reference-free method for evaluating figure captions.
We first constructed SCICAP-EVAL, a human evaluation dataset that contains human judgments for 3,600 scientific figure captions, both original and machine-made, for 600 arXiv figures.
We then prompted LLMs like GPT-4 and GPT-3 to score (1-6) each caption based on its potential to aid reader understanding, given relevant context such as figure-mentioning paragraphs.
Results show that GPT-4, used as a zero-shot evaluator, outperformed all other models and even surpassed assessments made by computer science undergraduates, achieving a Kendall correlation score of 0.401 with Ph.D. students' rankings.
Submission Number: 4570
Loading