AI-Figures: A Fine-grained Task-oriented Dataset for developing Multimodal Scientific Literature Understanding

ACL ARR 2025 February Submission7640 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Diagrams and figures are a powerful medium of communication in scientific research. There is a recent spark in interest in the development of Machine Learning-driven applications involving scientific figures such as multimodal question answering, multimodal document retrieval, text-to-image generation or image captioning. Challenging tasks in this domain may be dependent only on a specific category of scientific figures. But there are no datasets in prior literature which provide a domain-specific broad classification of scientific figures. To fill this gap, we introduce AI-Figures a large scale dataset containing scientific figure-caption pairs which are classified into 9 different categories. We create this dataset by leveraging the idea of image segmentation and classification using the YOLO model. Our automated data acquisition pipeline can be implemented on other datasets also in order to classify their figures. We benchmark 6 Large Language Vision models and 5 Large Language models on our dataset for various tasks such as figure captioning, tag classification, text-to-figure generation, multimodal question answering and multimodal document retrieval. We show that there is a significant increase in a model's inference capabilities when we finetune it on our dataset. Our dataset and code will be released in the final version.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: Multimodal Scholarly Data, Large Vision Language Model, YOLO
Contribution Types: Data resources, Data analysis
Languages Studied: English
Submission Number: 7640
Loading