Multimodal Large Language Models for Visual Segment Classification and Description Generation in Digital Storybooks

Multimodal Large Language Models for Visual Segment Classification and Description Generation in Digital Storybooks

ACL ARR 2025 July Submission1022 Authors

29 Jul 2025 (modified: 12 Aug 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Multimodal Large Language Models (MLLMs) are advanced in handling complex visual-textual tasks, but their application to narrative-driven contexts remains underexplored. In this work, we evaluate the ability of MLLMs to identify relevant visual segments and generate descriptions for segments in illustrated digital storybooks. We curate a dataset of 14,162 segments, extracted from 32 Arabic children's digital storybooks through the Segment Anything Model (SAM), with human annotations for segment relevance and descriptive labels. We evaluate five state-of-the-art MLLMs across zero-shot prompting conditions, and evaluate the two best-performing models through few-shot. Our results show that few-shot prompting of GPT-4o achieves the best results for segment relevance classification. While all models struggle with fine-grained contextual reasoning, our findings provide insights for developing AI-powered interactive digital storybooks and help advance multimodal methodologies in narrative understanding tasks.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: image text matching, cross-modal content generation, cross-modal application, cross-modal information extraction, multimodality

Contribution Types: Model analysis & interpretability

Languages Studied: English, Arabic

Reassignment Request Area Chair: This is not a resubmission

Reassignment Request Reviewers: This is not a resubmission

A1 Limitations Section: This paper has a limitations section.

A2 Potential Risks: N/A

B Use Or Create Scientific Artifacts: Yes

B1 Cite Creators Of Artifacts: Yes

B1 Elaboration: We curated our own dataset for section 3 and 4

B2 Discuss The License For Artifacts: No

B2 Elaboration: We are not releasing the dataset publicly, thus licensing terms are not discussed in the paper.

B3 Artifact Use Consistent With Intended Use: N/A

B4 Data Contains Personally Identifying Info Or Offensive Content: N/A

B5 Documentation Of Artifacts: Yes

B5 Elaboration: Section 3 describes the kind of data curated and the annotation process

B6 Statistics For Data: Yes

B6 Elaboration: Yes in Section 3 for dataset details and Section 4 for calculations

C Computational Experiments: Yes

C1 Model Size And Budget: No

C1 Elaboration: We used proprietary language models (GPT-4o, GPT-4o mini, Claude, Gemini) via API . Model sizes and computational budgets are not publicly available and not under our control.

C2 Experimental Setup And Hyperparameters: Yes

C2 Elaboration: We just did prompt design, discussed in Section 3.3

C3 Descriptive Statistics: Yes

C3 Elaboration: Section 4 covers the metrics for performance, accuracy and semantic similarty across results

C4 Parameters For Packages: Yes

C4 Elaboration: Yes, in 3.2 for SpaCy

D Human Subjects Including Annotators: Yes

D1 Instructions Given To Participants: Yes

D1 Elaboration: Yes, Section 3.2 and appendix; no risks

D2 Recruitment And Payment: Yes

D2 Elaboration: Yes, Section 3.2

D3 Data Consent: Yes

D3 Elaboration: Yes, Section 3.2

D4 Ethics Review Board Approval: Yes

D4 Elaboration: Yes, determined exempt

D5 Characteristics Of Annotators: Yes

D5 Elaboration: Yes, just the languages (bilingual in english and arabic as it is only what is relevant for this research). Section 3.2

E Ai Assistants In Research Or Writing: No

E1 Information About Use Of Ai Assistants: N/A

Author Submission Checklist: yes

Submission Number: 1022

Loading