AI-Figures: A Fine-grained Task-oriented Dataset for Multimodal Scientific Literature Understanding

ACL ARR 2026 January Submission10146 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multimodal Scholarly Data, Large Vision Language Model, YOLO
Abstract: Diagrams and figures are a powerful medium of communication in scientific research. There is a recent spark in interest in the development of Machine Learning-driven applications involving scientific figures such as multimodal question answering, multimodal document retrieval, text-to-image generation, or image captioning. Challenging tasks in this domain may be dependent only on a specific category of scientific figures. But there are no datasets in prior literature that provide a domain-specific, broad classification of scientific figures. To fill this gap, we introduce AI-Figures, a large-scale dataset containing scientific figure-caption pairs which are classified into 8 different categories. We create this dataset by leveraging the idea of image segmentation and classification using the YOLO model. Our automated data acquisition pipeline can also be implemented on other datasets also to classify their figures. We benchmark our dataset for various tasks such as figure captioning, text-to-figure generation, scholarly multimodal question answering, and multimodal document retrieval using various vision-based models. We show that there is a significant increase in a model's inference capabilities when we finetune it on targeted classes of our dataset. Our dataset and code will be made public upon acceptance.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: Multimodality, Corpus creation, Benchmarking
Contribution Types: Data resources, Data analysis
Languages Studied: English
Submission Number: 10146
Loading