Multimodal Large Language Models for Visual Segment Classification and Description Generation in Digital Storybooks
Abstract: Multimodal Large Language Models (MLLMs) are advanced in handling complex visual-textual tasks, but their application to narrative-driven contexts remains underexplored. In this work, we evaluate the ability of MLLMs to identify relevant visual segments and generate descriptions for segments in illustrated digital storybooks. We curate a dataset of 14,162 segments, extracted from 32 Arabic children's digital storybooks through the Segment Anything Model (SAM), with human annotations for segment relevance and descriptive labels. We evaluate five state-of-the-art MLLMs across zero-shot prompting conditions, and evaluate the two best-performing models through few-shot. Our results show that few-shot prompting of GPT-4o achieves the best results for segment relevance classification. While all models struggle with fine-grained contextual reasoning, our findings provide insights for developing AI-powered interactive digital storybooks and help advance multimodal methodologies in narrative understanding tasks.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: image text matching, cross-modal content generation, cross-modal application, cross-modal information extraction, multimodality
Contribution Types: Model analysis & interpretability
Languages Studied: English, Arabic
Reassignment Request Area Chair: This is not a resubmission
Reassignment Request Reviewers: This is not a resubmission
A1 Limitations Section: This paper has a limitations section.
A2 Potential Risks: N/A
B Use Or Create Scientific Artifacts: Yes
B1 Cite Creators Of Artifacts: Yes
B1 Elaboration: We curated our own dataset for section 3 and 4
B2 Discuss The License For Artifacts: No
B2 Elaboration: We are not releasing the dataset publicly, thus licensing terms are not discussed in the paper.
B3 Artifact Use Consistent With Intended Use: N/A
B4 Data Contains Personally Identifying Info Or Offensive Content: N/A
B5 Documentation Of Artifacts: Yes
B5 Elaboration: Section 3 describes the kind of data curated and the annotation process
B6 Statistics For Data: Yes
B6 Elaboration: Yes in Section 3 for dataset details and Section 4 for calculations
C Computational Experiments: Yes
C1 Model Size And Budget: No
C1 Elaboration: We used proprietary language models (GPT-4o, GPT-4o mini, Claude, Gemini) via API . Model sizes and computational budgets are not publicly available and not under our control.
C2 Experimental Setup And Hyperparameters: Yes
C2 Elaboration: We just did prompt design, discussed in Section 3.3
C3 Descriptive Statistics: Yes
C3 Elaboration: Section 4 covers the metrics for performance, accuracy and semantic similarty across results
C4 Parameters For Packages: Yes
C4 Elaboration: Yes, in 3.2 for SpaCy
D Human Subjects Including Annotators: Yes
D1 Instructions Given To Participants: Yes
D1 Elaboration: Yes, Section 3.2 and appendix; no risks
D2 Recruitment And Payment: Yes
D2 Elaboration: Yes, Section 3.2
D3 Data Consent: Yes
D3 Elaboration: Yes, Section 3.2
D4 Ethics Review Board Approval: Yes
D4 Elaboration: Yes, determined exempt
D5 Characteristics Of Annotators: Yes
D5 Elaboration: Yes, just the languages (bilingual in english and arabic as it is only what is relevant for this research). Section 3.2
E Ai Assistants In Research Or Writing: No
E1 Information About Use Of Ai Assistants: N/A
Author Submission Checklist: yes
Submission Number: 1022
Loading