AI-Figures: A Fine-grained Task-oriented Dataset for Multimodal Scientific Literature Understanding

AI-Figures: A Fine-grained Task-oriented Dataset for Multimodal Scientific Literature Understanding

ACL ARR 2026 January Submission10146 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal Scholarly Data, Large Vision Language Model, YOLO

Abstract: Diagrams and figures are a powerful medium of communication in scientific research. There is a recent spark in interest in the development of Machine Learning-driven applications involving scientific figures such as multimodal question answering, multimodal document retrieval, text-to-image generation, or image captioning. Challenging tasks in this domain may be dependent only on a specific category of scientific figures. But there are no datasets in prior literature that provide a domain-specific, broad classification of scientific figures. To fill this gap, we introduce AI-Figures, a large-scale dataset containing scientific figure-caption pairs which are classified into 8 different categories. We create this dataset by leveraging the idea of image segmentation and classification using the YOLO model. Our automated data acquisition pipeline can also be implemented on other datasets also to classify their figures. We benchmark our dataset for various tasks such as figure captioning, text-to-figure generation, scholarly multimodal question answering, and multimodal document retrieval using various vision-based models. We show that there is a significant increase in a model's inference capabilities when we finetune it on targeted classes of our dataset. Our dataset and code will be made public upon acceptance.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: Multimodality, Corpus creation, Benchmarking

Contribution Types: Data resources, Data analysis

Languages Studied: English

Submission Number: 10146

Loading