Towards Multimodal Question Answering in Educational Domain

Towards Multimodal Question Answering in Educational Domain

ACL ARR 2025 July Submission119 Authors

23 Jul 2025 (modified: 08 Sept 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: The proliferation of educational videos on the Internet has changed the educational landscape by enabling students to learn complex concepts at their own pace. Our work outlines the vision of an automated tutor – a multimodal question answering (QA) system to answer questions from students watching a video. This can make doubt resolution faster and further improve learning experience. In this work, we take first steps towards building such a QA system. We curate and release a dataset named EduVidQA, with 3,158 videos and 18,474 QA-pairs. However, building and evaluating an educational QA system is challenging because (1) existing evaluation metrics do not correlate with human judgments, and (2) a student question could be answered in many different ways, training on a single gold answer could confuse the model and make it worse. We conclude with important research questions to develop this research area further.

Paper Type: Short

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: vision question answering, video processing, multimodality

Contribution Types: NLP engineering experiment, Data resources

Languages Studied: English

Previous URL: https://openreview.net/forum?id=lT9xK9pfLp

Explanation Of Revisions PDF: pdf

Reassignment Request Area Chair: No, I want the same area chair from our previous submission (subject to their availability).

Reassignment Request Reviewers: No, I want the same set of reviewers from our previous submission (subject to their availability)

Software: zip

Data: zip

A1 Limitations Section: This paper has a limitations section.

A2 Potential Risks: Yes

A2 Elaboration: Section 8

B Use Or Create Scientific Artifacts: Yes

B1 Cite Creators Of Artifacts: Yes

B1 Elaboration: Section 8

B2 Discuss The License For Artifacts: Yes

B2 Elaboration: Section 8

B3 Artifact Use Consistent With Intended Use: Yes

B3 Elaboration: Section 8

B4 Data Contains Personally Identifying Info Or Offensive Content: Yes

B4 Elaboration: Section 8

B5 Documentation Of Artifacts: N/A

B6 Statistics For Data: Yes

B6 Elaboration: Section 3

C Computational Experiments: Yes

C1 Model Size And Budget: Yes

C1 Elaboration: Section 5.1

C2 Experimental Setup And Hyperparameters: Yes

C2 Elaboration: Section 5.1

C3 Descriptive Statistics: Yes

C3 Elaboration: Section 5.2

C4 Parameters For Packages: Yes

C4 Elaboration: Section 4

D Human Subjects Including Annotators: Yes

D1 Instructions Given To Participants: No

D1 Elaboration: Exactly the same as prompts discussed in Appendix A

D2 Recruitment And Payment: N/A

D3 Data Consent: Yes

D3 Elaboration: Section 8

D4 Ethics Review Board Approval: N/A

D5 Characteristics Of Annotators: N/A

E Ai Assistants In Research Or Writing: Yes

E1 Information About Use Of Ai Assistants: Yes

E1 Elaboration: Appendix A

Author Submission Checklist: yes

Submission Number: 119

Loading